Faster Data Loads with Apportioned Load

Vertica can divide the work of loading data, taking advantage of parallelism to speed up the operation. One supported type of parallelism is called apportioned load.

An apportioned load divides a single large file or other single source into segments (portions), which are assigned to several nodes to be loaded in parallel.

Example:

I want to load a data file that contains 100,000,000 records.

dbadmin=> \! wc -l /home/dbadmin/big_data.txt
100000000 /home/dbadmin/big_data.txt

For my first load attempt, I’ll load the file from a single node in my 3 node cluster.

dbadmin=> \timing
Timing is on.

dbadmin=> COPY big_data FROM '/home/dbadmin/big_data.txt' DIRECT;
Rows Loaded
-------------
   100000000
(1 row)

Time: First fetch (1 row): 49078.222 ms. All rows formatted: 49078.268 ms

Next I will re-run the load, but this time include the “ON ANY NODE” option of the COPY command so that Vertica performs an apportioned load.

dbadmin=> COPY big_data FROM '/home/dbadmin/big_data.txt' ON ANY NODE DIRECT;
Rows Loaded
-------------
   100000000
(1 row)

Time: First fetch (1 row): 21141.006 ms. All rows formatted: 21141.045 ms

Wow! An apportioned load executed over twice as fast as a single node load!

dbadmin=> SELECT 100 - (21141.006 / 49078.222 * 100) || '%' PCT_FASTER;
     PCT_FASTER
---------------------
56.923855146993700%
(1 row)

Helpful link:
https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/ExtendingVertica/UDx/UDL/ApportionedLoad.htm

Have fun!

Sign In or Register to comment.