Text indexing for documents stored in Vertica?

I am designing a hybrid system in which we have both text documents and some derived numerical properties from them. The numeric part is stored in Vertica, since we need analytic queries over it. We also need free-form text search for the web interface, a-la Lucene or a similar indexer.
Is there some free-form search (i.e. not regular expression but by sets of words) over text in Vertica? Alt., did someone had good experience with piggy-bagging an indexing engine over Vertica?

Comments

  • Daniel_LeybovicDaniel_Leybovic Registered User
    Hi!
    • Vertica has C++ SDK. You can implement your own indexer: take a look on example of Inverted Index (/opt/vertica/sdk/examples/TransformFunctions/InvertedIndex.cpp) You can find some free library(CLucene for example) and use it in your C++ UDFx. I can help with development.
    • Vertica has R-lang SDK (http://www.johnmyleswhite.com/notebook/2009/02/25/text-processing-in-r/)
    • You can use in Lucene (Vertica has JDBC/HADOOP/HDFS connectors)
    PS
    You can't upload a big texts, since Vertica's strings are limited to 64KB, so usage in external tools (even may be in other database) is almost inevitably.
  • Guy_WienerGuy_Wiener Registered User
    I figured I can use Lucene by using JDBC, but using an external tool is a second priority, as it tampers with the transactions and workflow. Although, you're right to say that for supporting text documents with no limitations it may be inevitable.

    I'll look into the other resources, thanks!
  • May_1May_1 Registered User
    Hi, Guy, I was just curious how you store the text files outside of Vertica. We have a similar situation here, but we'd like to store the files preferably with good compression, and similar to your requirements, searchable from Vertican
  • Guy_WienerGuy_Wiener Registered User
    I didn't figure out how to make them searchable from Vertica. The textbook solution seems to be sharing some identifier between the database tables and an indexer, e.g. Lucene. Implementing a reverse index as a Vertica table should work, but it requires more hacking than I can afford right now.

    As for storing compressed searchable documents, AFAIK Apache's Lucene can compress stored documents, and obviously also make them searchable. Consult its manual for details.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file