Text indexing for documents stored in Vertica?
I am designing a hybrid system in which we have both text documents and some derived numerical properties from them. The numeric part is stored in Vertica, since we need analytic queries over it. We also need free-form text search for the web interface, a-la Lucene or a similar indexer.
Is there some free-form search (i.e. not regular expression but by sets of words) over text in Vertica? Alt., did someone had good experience with piggy-bagging an indexing engine over Vertica?
Is there some free-form search (i.e. not regular expression but by sets of words) over text in Vertica? Alt., did someone had good experience with piggy-bagging an indexing engine over Vertica?
0
Comments
- Vertica has C++ SDK. You can implement your own indexer: take a look on example of Inverted Index (/opt/vertica/sdk/examples/TransformFunctions/InvertedIndex.cpp) You can find some free library(CLucene for example) and use it in your C++ UDFx. I can help with development.
- Vertica has R-lang SDK (http://www.johnmyleswhite.com/notebook/2009/02/25/text-processing-in-r/)
- You can use in Lucene (Vertica has JDBC/HADOOP/HDFS connectors)
PSYou can't upload a big texts, since Vertica's strings are limited to 64KB, so usage in external tools (even may be in other database) is almost inevitably.
I'll look into the other resources, thanks!
As for storing compressed searchable documents, AFAIK Apache's Lucene can compress stored documents, and obviously also make them searchable. Consult its manual for details.