We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


COPY FROM PARQUET performance — Vertica Forum

COPY FROM PARQUET performance

dimitri_pdimitri_p
edited October 9 in General Discussion

So I have a 3-node Vertica Enterprise v25 cluster with 36 physical cores on each node and I am trying to find the fastest way to export and import data to/from parquet.
I created 10 identical tables for my tests and I export them all at the same time like so

EXPORT TO PARQUET (directory='/data/export/test_table0', compression='zstd') AS SELECT * FROM test_table0;

and once all the exports are finished, I import them like so (all at the same time):

COPY test_table0 FROM '/data/export/test_table0/*.parquet' ON EACH NODE PARQUET (do_soft_schema_match_by_name='True');

I need both: ON EACH NODE and do_soft_schema_match_by_name, so they're there for a reason.

My problem is, that by varying pool's query budget and the command count, on EXPORT I can saturate the cluster's resources fully: up to 100% CPU usage and up to 100% RAM usage.

But whatever I do, I can't seem to get the COPY commands to consume above 50-55% CPU. It gets to 50% CPU at 5 COPY commands or so, and adding more COPY commands in parallel just makes them all finish proportionally later. Which makes me think there's room for optimization. Any idea what that could be?

Thank you!

Answers

  • SruthiASruthiA Administrator

    adding more COPY commands in parallel just makes them all finish proportionally later ---- Does it mean that they are executed after first 5 copies are completed?

This discussion has been closed.