Need to improve performance in COPY
Hi,
We have a daily job which run to read data from .tar file of size 35GB. This file has 500-600 .gz files. Script uses COPY statement to read data from file & loads into staging table.
COPY <> FROM STDIN GZIP DELIMITER ',' DIRECT REJECTMAX 500 NULL as '' "
It takes almost 6-8 hours to process one file. Could you please suggest if there is any option to reduce the processing time.
0
Comments
Hi!
For 35GB compressed data? Hm... Vertica CAN do it much more faster, so I think a problem in script. Without script itself,
COPY
command(is it do some complex filtering/parsing?) and Vertica configuration(how many nodes? how do you sure that a bottleneck not in disk/net/cpu?) its hard to help.Your problem is "copy from STDIN" . This goes through a bottleneck - the network - with one single thread for parsing.
The best approach for a fast load is using a file server, efficiently attached to the platform that produces your zipped load files. Don't leave tar files there; use single files gzipped.
The same file server should be mounted to each of the Vertica nodes, to the same directory name - and have a good network connection, of course.
Then, you can just go:
COPY tb FROM '/shared/mounted/directory/tb*.csv.gz' ON ANY NODE GZIP DELIMITER ',' ENCLOSED BY '"' DIRECT [...]
The effect will be that you will get a considerable number of threads parsing your input files in parallel - instead of a single parsing thread as you get it with
FROM STDIN
.Some preparations, but worth a try ...
Marco
Hi!
@marcothesane
In general I agree with you but ... if it is a single node cluster and/or
tar
file on Vertica cluster node? From my experience 35Gb for 6-8 hours points more on IO/CPU bottleneck and not on network, because minimal requirement for Vertica is 1Gbit network, thats why I asked for physical configuration and script/COPY itself.@Vertica_User
How many time takes to unpack
tar
file?Hi Marcothesane,
Thanks for your suggestion.
Are you saying to mount the file server in all the nodes? will it consume space. Because we have only 20% space in all nodes.
@sKwa
It takes a min to unpack tar files.
@Vertica_User It won't use space. Secondly, Vertica has some minimum recommendations for free space available on the node. Not having enough free space will cause performance problems. Thirdly, what does your table DDL look like?