Tuning multiple copy commands in parallel

rajatpaliwal86 · September 2020

We have a single node setup of Vertica on which we execute multiple copy load commands in parallel. We have set up a resource pool with a planned concurrency of 15 and we have been firing 15 copy commands in parallel. A single copy command executes fast but with increased numbers of copy commands in parallel, the load speed has become slow. Is there any recommendation to tune default settings that can help in faster execution of the copy command in parallel?
How many parallel copy commands(with few million rows) is recommended?

Our use case is very simple
The service keeps loading events to a CSV file and then periodically we roll the file and fire the copy command providing the CSV. We can have multiple services running and that usually fire copy commands at the same time to the Vertica.
Is there any faster approach to load data to Vertica when we have a continuous stream of events coming?
One of the columns is varchar(65000) - any tuning recommendation?

Nimmi_gupta · September 2020

What's the vertica version? Can you check EnableApportionLoad is enabled for the Parallel Load Streams.

mosheg · September 2020

In addition check the recommendations mentioned here:
https://forum.vertica.com/discussion/comment/245482#Comment_245482

rajatpaliwal86 · September 2020

@Nimmi_gupta said:
What's the vertica version? Can you check EnableApportionLoad is enabled for the Parallel Load Streams.

Vertica 9.3. Does it matter for a single node setup too?

rajatpaliwal86 · September 2020

@mosheg said:
In addition check the recommendations mentioned here:
https://forum.vertica.com/discussion/comment/245482#Comment_245482

Can I use the batch insert from the continuous stream of incoming data, will it be equally performant like copy?
I know it implicitly uses copy command only but not sure about the performance.

Nimmi_gupta · September 2020

In one node cluster parallel loads will depends on number of core (Host processors), memory and diskspace. What's the maxconcurrency set to? Maxconcurrency will decide the maximum number of concurrent queries that can run against the pool. Planconcurrency is used for the estimate of the number of concurrent queries that may run against the pool. This parameter is used to calculate the query budget for a resource pool.

rajatpaliwal86 · September 2020

@Nimmi_gupta said:
In one node cluster parallel loads will depends on number of core (Host processors), memory and diskspace. What's the maxconcurrency set to? Maxconcurrency will decide the maximum number of concurrent queries that can run against the pool. Planconcurrency is used for the estimate of the number of concurrent queries that may run against the pool. This parameter is used to calculate the query budget for a resource pool.

Plannedconcurrency - 12, Maxconcurrency - 24

We're Moving!

Create My New Community Account Now

Tuning multiple copy commands in parallel

Answers

Leave a Comment