Efficient bzip2 of large data set

KKirkpatrick · July 2016

I have a large amount of data spread over 12 nodes and wish to export to bzip file. This will be a regular daily export, and I would like to minimize network overhead (and distribute workload) to ensure overall bzip operation creates files as quickly as possible with minimal impact to system.

To that end, I'd like to

1) have all data compressed on bzip2 file of local node on which data resides, and

2) avoid having all bzip2 compression work being done on a single node, as would occur with traditional vsql "SELECT * FROM TABLE" | bzip2 > final_file.bz2

My plan is to set up script which

1) forks 12 ssh sessions, each sending a vsql command to each of the 12 nodes, as "SELECT * FROM TABLE WHERE <segmentation_clause> = <node_number>" with output piped through bzip2 to create local file <file_name>.<node_number>.bz2.

2) scp all 12 bz2 files to single node

3) concat 12 bz2 files into 1 large bz2 file, <file_name>.bz2

I believe this would accomplish my goals, but want to make sure I'm not overlooking any more direct approaches (or solving a performance problem that doesn't actually exist). Would appreciate any feedback on that!

We're Moving!

Create My New Community Account Now

Efficient bzip2 of large data set

Leave a Comment