Fastest way to COPY data into Vertica cluster over the network?

May_1 · March 2014

We need to load data from files into Vertica cluster over the network, what would be the best way of doing it?
1. Is there a difference in loading performance between COPY and COPY LOCAL?
2. If use COPY, does it matter which node/host to copy the files to?
3. Would it help with the performance if we distribute the files to different nodes in the cluster?

[Deleted User] · March 2014

Hi May,

COPY LOCAL loads a single file at a time. COPY can parse files in parallel, which can yield a significant performance improvement.

If you place your files on multiple different computers, COPY (and not COPY LOCAL) can parse them in parallel. Then again, this means shuffling the data around in advance, which (depending on your setup) may take longer and negate some of the benefits that you would see from a faster COPY.

COPY (and COPY LOCAL) performs a number of operations on loaded data. First, it parses the data; it then segments it, partitions it, sorts it... All of the rest happen in parallel no matter what. But parsing is a sizeable chunk of work; parallelizing it can have a significant impact on performance.

Adam

May_1 · March 2014

Thanks, Adam!

If I place the files to multiple computers, how would the COPY command be look like? How do I tell Vertica to COPY file1, file4 and file7 from host1, file2, file5 and file8 from host2, and file3, file6 and file9 from host3?

Fastest way to COPY data into Vertica cluster over the network?

Comments

Leave a Comment