HP Vertica Parallel Export - Is it recommended?

David_Chen · November 2016

Hi,

Our goal is to export a massive table (on the order of TB bytes stored) to CSV(s), to transfer it elsewhere for archival. I've been searching the HP Vertica site, but have not found what I am looking for. Much of the documentation assumes that the user wants to export from 1 Vertica cluster to another Vertica cluster.

However, many non-HP sites reference the "parallel export" UDF here:

https://marketplace.saas.hpe.com/big-data/content/parallel-export

Is this a good option for what I'm looking to do? Or is the Parallel Export UDF deprecated, and thus not actively promoted on the HP site.

- David

David_Chen · November 2016

Looks like the larger file (391.8 KB) on https://marketplace.saas.hpe.com/big-data/content/parallel-export does not build - due to multiple definitions of a function: Basics::BigInt::binaryToDecScale(void const*, int, char*, int, int)'

/tmp/ccptMaF0.o: In function `Basics::BigInt::binaryToDecScale(void const*, int, char*, int, int)':
/opt/vertica/sdk/include/Vertica.cpp:552: multiple definition of `Basics::BigInt::binaryToDecScale(void const*, int, char*, int, int)'

The smaller file looks to be the one to download, which does build successfully.

David_Chen · January 2017

After trying out Parallel Export, I would have to say "yes, with caveats".

The upside: You get the fastest possible way to export a table, with great flexibility. You avoid the inter-node network traffic that slows down a general purpose vsql select to file.

The downsides:

1. Slows down other clients seeking Vertica connections. If you're exporting a huge table, break it up into smaller pieces, and export those tables, so it doesn't interfere with existing processes.

2. if you have a custom command ("cmd") which writes to custom files, like sed, gzip, or anything, be sure to use $$ to indicate PID (process ID). Otherwise, the processes will collide with each other and overwrite each other's files.

David_Chen · January 2017

Sample Code:

\timing
SELECT ParallelExport("blah", "ha", "field", "field2" 
USING PARAMETERS separator='|', cmd='sed ''s/\\/\\\\/g'' | sed ''s/__REPLACE__/\\|/g'' | gzip -9 -c - > /some/dir/which/exists/on/all/vertica/machines/test.my_table_here.$$.gz') 
OVER (PARTITION BEST) FROM test.my_table_here;

opper20 · May 2018

Hi, can i use aggregate function on column while exporting.i need to do sum(column) and then export the output.Is it possible?

Jim_Knicely · June 2018

As far as I know that Parallel Export UDX is no longer available.

However, if you have it already, you should be able to export a sum if you use a sub query...

Example:

select ParallelExport(c1_sum)
 using parameters
 path='/tmp/test.csv.${nodeName}', separator=',')
over (partition auto) from (select sum(c1) c1_sum from some_table);

We're Moving!

Create My New Community Account Now

HP Vertica Parallel Export - Is it recommended?

Comments

Leave a Comment