HP Vertica Parallel Export - Is it recommended?

 

Hi,

 

Our goal is to export a massive table (on the order of TB bytes stored) to CSV(s), to transfer it elsewhere for archival.  I've been searching the HP Vertica site, but have not found what I am looking for.  Much of the documentation assumes that the user wants to export from 1 Vertica cluster to another Vertica cluster.

 

However, many non-HP sites reference the "parallel export" UDF here:

 

https://marketplace.saas.hpe.com/big-data/content/parallel-export

 

Is this a good option for what I'm looking to do?  Or is the Parallel Export UDF deprecated, and thus not actively promoted on the HP site.

 

- David

 

 

 

 

Comments

  • David_ChenDavid_Chen Registered User

     

    Looks like the larger file (391.8 KB) on https://marketplace.saas.hpe.com/big-data/content/parallel-export does not build - due to multiple definitions of a function: Basics::BigInt::binaryToDecScale(void const*, int, char*, int, int)'

     

    /tmp/ccptMaF0.o: In function `Basics::BigInt::binaryToDecScale(void const*, int, char*, int, int)':
    /opt/vertica/sdk/include/Vertica.cpp:552: multiple definition of `Basics::BigInt::binaryToDecScale(void const*, int, char*, int, int)'

     

    The smaller file looks to be the one to download, which does build successfully.

     

     

  • David_ChenDavid_Chen Registered User

     

    After trying out Parallel Export, I would have to say "yes, with caveats".

     

    The upside: You get the fastest possible way to export a table, with great flexibility.  You avoid the inter-node network traffic that slows down a general purpose vsql select to file.

     

    The downsides:

     

    1. Slows down other clients seeking Vertica connections.  If you're exporting a huge table, break it up into smaller pieces, and export those tables, so it doesn't interfere with existing processes.

     

    2. if you have a custom command ("cmd") which writes to custom files, like sed, gzip, or anything, be sure to use $$ to indicate PID (process ID).  Otherwise, the processes will collide with each other and overwrite each other's files.

     

     

  • David_ChenDavid_Chen Registered User

    Sample Code:

     

    \timing
    SELECT ParallelExport("blah", "ha", "field", "field2"
    USING PARAMETERS separator='|', cmd='sed ''s/\\/\\\\/g'' | sed ''s/__REPLACE__/\\|/g'' | gzip -9 -c - > /some/dir/which/exists/on/all/vertica/machines/test.my_table_here.$$.gz')
    OVER (PARTITION BEST) FROM test.my_table_here;
  • opper20opper20 Registered User

    Hi, can i use aggregate function on column while exporting.i need to do sum(column) and then export the output.Is it possible?

  • Jim_KnicelyJim_Knicely Employee, Registered User, VerticaExpert
    edited June 10

    As far as I know that Parallel Export UDX is no longer available.

    However, if you have it already, you should be able to export a sum if you use a sub query...

    Example:

    select ParallelExport(c1_sum)
     using parameters
     path='/tmp/test.csv.${nodeName}', separator=',')
    over (partition auto) from (select sum(c1) c1_sum from some_table);
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file