COPY FROM PARQUET performance
So I have a 3-node Vertica Enterprise v25 cluster with 36 physical cores on each node and I am trying to find the fastest way to export and import data to/from parquet.
I created 10 identical tables for my tests and I export them all at the same time like so
EXPORT TO PARQUET (directory='/data/export/test_table0', compression='zstd') AS SELECT * FROM test_table0;
and once all the exports are finished, I import them like so (all at the same time):
COPY test_table0 FROM '/data/export/test_table0/*.parquet' ON EACH NODE PARQUET (do_soft_schema_match_by_name='True');
I need both: ON EACH NODE and do_soft_schema_match_by_name, so they're there for a reason.
My problem is, that by varying pool's query budget and the command count, on EXPORT I can saturate the cluster's resources fully: up to 100% CPU usage and up to 100% RAM usage.
But whatever I do, I can't seem to get the COPY commands to consume above 50-55% CPU. It gets to 50% CPU at 5 COPY commands or so, and adding more COPY commands in parallel just makes them all finish proportionally later. Which makes me think there's room for optimization. Any idea what that could be?
Thank you!
Answers
adding more COPY commands in parallel just makes them all finish proportionally later ---- Does it mean that they are executed after first 5 copies are completed?