Bulk load data into Vertica from HDFS
Hello,
I have a csv file which is residing on HDFS. I want to bulk load this file into vertica
I am reading this pdf
http://www.vertica.com/wp-content/uploads/2011/01/FastDataLoadingInVertica.pdf
but this does not talk about fast loading data from HDFS into vertica.
what is the best way to load HDFS data into vertica?
Edit::
I did see this thread
https://community.dev.hpe.com/t5/Vertica-Forum/load-from-hadoop/m-p/219695/highlight/true#M7434
But i am not loading from "hive" to vertica. I want to keep my data as a CSV in HDFS and then upload it to vertica.
Edit::
Also, please tell me how is this approach compared to a tool called sqoop? does it make sense that I juse sqoop for my data load?
0
Comments
The hdfs connector should work well for you. Here's relevant doc for vertica v7.2 (currently the latest release) https://my.vertica.com/docs/7.2.x/HTML/Content/Authoring/HadoopIntegrationGuide/HDFSConnector/LoadingDataFromHDFS.htm. The hdfs connector has existed for many versions now, check documentation for whatever version of vertica you are currently running.
In a nutshell the hdfs connector lets vertica read bytes from hdfs as a source of data for COPY statements or external tables (yes, you can query files that reside in hdfs without loading that data into vertica). You can use a built-in or user-defined parser to turn those bytes into tuples (and you can use zero or more built-in or user-defined filters to uncompress or otherwise transform the bytes between the source and parser). In your case you're parsing csv format, so you should be able to use vertica's built-in delimited parser.