can vertica HDFS connector work with file format other than text file?

yan_fei · August 2014

I can use COPY statement to pull data from Hadoop to vertica using Vertica HDFS connector, but file format i work with is delimited text file so far, I am wondering if there is possibility to pull data from HDFS which is stored as sequence file or parquet file, if it can, how should I do it?

[Deleted User] · August 2014

Hi Yan,

Vertica provides built-in support for delimited and fixed-width text files.

Vertica's Flex package (bundled with Vertica 7+) provides support for JSON files. As of Vertica 7.1, it supports CEF files; it also has a "Regex" parser, which uses an arbitrary regular expression (provided as part of the query string) to parse a file. Flex's documentation is here:

http://my.vertica.com/docs/7.1.x/PDF/HP_Vertica_7.1.x_Flex_Tables.pdf

(Scroll to page 27 to see the section on loading data in different formats.)

Vertica also provides an API for writing user-defined parsers. These, like the Flex parsers above, can be used for loading data from any source, including Hadoop. Several such parsers have been published to the Vertica Marketplace:

http://vertica.com/marketplace

(Take a look in the "ETL and Data Ingest" section.) There is an Avro parser in this section. Also various other connectors that load data directly from other APIs that provide data via function calls so don't need parsing in the traditional sense. There's also the "Shell Load Package"; you could use its ExternalFilter function to run data through an arbitrary external script which can convert it into a format that Vertica has a parser for.

To answer your specific question -- no, Vertica unfortunately does not provide a built-in parser for sequence or parquet files in particular. (Though many Hadoop users seem to be able to make do with Avro or JSON, which we do provide parsers for.) But you are welcome to write one if you'd like. You can use the API described above. It's common to just go and find some third-party library that can parse the specified file format, then write a thin wrapper to teach Vertica to call that library. Examples are located in "/opt/vertica/sdk/examples/ParserFunctions/" (C++) and "/opt/vertica/sdk/examples/JavaUDx/UDLFunctions/*" (Java) on any machine with Vertica installed. Also on our github site: (Pull requests welcome!)

http://github.com/vertica

Adam

yan_fei · August 2014

Thanks for quick reply, adam, will check it out based on your suggestion, also I just tried to use pig and ParquetLoader , something like B = LOAD '/tmp/myfile' USING parquet.pig.ParquetLoader; when I dump it, the data looks correct, but when i pulled into vertica like below
STORE B INTO '{schema.mytable(c1 int,c2 varchar(4000),c3 int, c4 varchar(4000))}' USING com.vertica.pig.VerticaStorer('vertica01,vertica02,vertica03','mydb','5433','user','pwd');
the String/Varchar field looks wrong, for example, say C2 is 'V100x2t1' in hadoop, in Vertica , it shows '5631303078327431' - so is this because Vertica Storer doesn't understand Parquet? Is there a way to do it?

[Deleted User] · August 2014

Hi Yan,

Hm... Unfortunately, I have no idea why it's modifying the value this way.

I'm not personally super-familiar with Pig. But, does this happen with every input string? (I'm wondering if something somewhere is seeing the "0x..." in this example and thinking for some reason that it's a hex string, for example.) Does this happen if you store the same values into a JSON file and then try to load the JSON file? (If so, if you could come up with a really simple single line or two of JSON that repeats this issue, that would be really helpful in tracking it down.)

Thanks,
Adam

We're Moving!

Create My New Community Account Now

can vertica HDFS connector work with file format other than text file?

Comments

Leave a Comment