Efficient bzip2 of large data set

I have a large amount of data spread over 12 nodes and wish to export to bzip file.  This will be a regular daily export, and I would like to minimize network overhead (and distribute workload) to ensure overall bzip operation creates files as quickly as possible with minimal impact to system. 

 

To that end, I'd like to

1) have all data compressed on bzip2 file of local node on which data resides, and

2) avoid having all bzip2 compression work being done on a single node, as would occur with traditional vsql "SELECT * FROM TABLE" | bzip2 > final_file.bz2

 

 

My plan is to set up script which

1) forks 12 ssh sessions, each sending a vsql command to each of the 12 nodes, as "SELECT * FROM TABLE WHERE <segmentation_clause> = <node_number>" with output piped through bzip2 to create local file <file_name>.<node_number>.bz2. 

2) scp all 12 bz2 files to single node

3) concat 12 bz2 files into 1 large bz2 file, <file_name>.bz2

 

I believe this would accomplish my goals, but want to make sure I'm not overlooking any more direct approaches (or solving a performance problem that doesn't actually exist).  Would appreciate any feedback on that!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file