Need to improve performance in COPY

Vertica_User · February 2018

Hi,
We have a daily job which run to read data from .tar file of size 35GB. This file has 500-600 .gz files. Script uses COPY statement to read data from file & loads into staging table.

COPY <> FROM STDIN GZIP DELIMITER ',' DIRECT REJECTMAX 500 NULL as '' "

It takes almost 6-8 hours to process one file. Could you please suggest if there is any option to reduce the processing time.

sKwa · February 2018

Hi!

It takes almost 6-8 hours to process one file.

For 35GB compressed data? Hm... Vertica CAN do it much more faster, so I think a problem in script. Without script itself, COPY command(is it do some complex filtering/parsing?) and Vertica configuration(how many nodes? how do you sure that a bottleneck not in disk/net/cpu?) its hard to help.

marcothesane · February 2018

Your problem is "copy from STDIN" . This goes through a bottleneck - the network - with one single thread for parsing.

The best approach for a fast load is using a file server, efficiently attached to the platform that produces your zipped load files. Don't leave tar files there; use single files gzipped.

The same file server should be mounted to each of the Vertica nodes, to the same directory name - and have a good network connection, of course.

Then, you can just go:
COPY tb FROM '/shared/mounted/directory/tb*.csv.gz' ON ANY NODE GZIP DELIMITER ',' ENCLOSED BY '"' DIRECT [...]

The effect will be that you will get a considerable number of threads parsing your input files in parallel - instead of a single parsing thread as you get it with FROM STDIN.

Some preparations, but worth a try ...
Marco

sKwa · February 2018

Hi!

@marcothesane

This goes through a bottleneck - the network - with one single thread for parsing.

In general I agree with you but ... if it is a single node cluster and/or tar file on Vertica cluster node? From my experience 35Gb for 6-8 hours points more on IO/CPU bottleneck and not on network, because minimal requirement for Vertica is 1Gbit network, thats why I asked for physical configuration and script/COPY itself.

@Vertica_User
How many time takes to unpack tar file?

Vertica_User · February 2018

Hi Marcothesane,
Thanks for your suggestion.
Are you saying to mount the file server in all the nodes? will it consume space. Because we have only 20% space in all nodes.

@sKwa
It takes a min to unpack tar files.

aurorak · February 2018

@Vertica_User It won't use space. Secondly, Vertica has some minimum recommendations for free space available on the node. Not having enough free space will cause performance problems. Thirdly, what does your table DDL look like?

We're Moving!

Create My New Community Account Now

Need to improve performance in COPY

Comments

Leave a Comment