Tuning multiple copy commands in parallel

We have a single node setup of Vertica on which we execute multiple copy load commands in parallel. We have set up a resource pool with a planned concurrency of 15 and we have been firing 15 copy commands in parallel. A single copy command executes fast but with increased numbers of copy commands in parallel, the load speed has become slow. Is there any recommendation to tune default settings that can help in faster execution of the copy command in parallel?
How many parallel copy commands(with few million rows) is recommended?

Our use case is very simple
The service keeps loading events to a CSV file and then periodically we roll the file and fire the copy command providing the CSV. We can have multiple services running and that usually fire copy commands at the same time to the Vertica.
Is there any faster approach to load data to Vertica when we have a continuous stream of events coming?
One of the columns is varchar(65000) - any tuning recommendation?

Answers

  • What's the vertica version? Can you check EnableApportionLoad is enabled for the Parallel Load Streams.

  • moshegmosheg Vertica Employee Administrator

    In addition check the recommendations mentioned here:
    https://forum.vertica.com/discussion/comment/245482#Comment_245482

  • edited September 2020

    @Nimmi_gupta said:
    What's the vertica version? Can you check EnableApportionLoad is enabled for the Parallel Load Streams.

    Vertica 9.3. Does it matter for a single node setup too?

  • @mosheg said:
    In addition check the recommendations mentioned here:
    https://forum.vertica.com/discussion/comment/245482#Comment_245482

    Can I use the batch insert from the continuous stream of incoming data, will it be equally performant like copy?
    I know it implicitly uses copy command only but not sure about the performance.

  • In one node cluster parallel loads will depends on number of core (Host processors), memory and diskspace. What's the maxconcurrency set to? Maxconcurrency will decide the maximum number of concurrent queries that can run against the pool. Planconcurrency is used for the estimate of the number of concurrent queries that may run against the pool. This parameter is used to calculate the query budget for a resource pool.

  • @Nimmi_gupta said:
    In one node cluster parallel loads will depends on number of core (Host processors), memory and diskspace. What's the maxconcurrency set to? Maxconcurrency will decide the maximum number of concurrent queries that can run against the pool. Planconcurrency is used for the estimate of the number of concurrent queries that may run against the pool. This parameter is used to calculate the query budget for a resource pool.

    Plannedconcurrency - 12, Maxconcurrency - 24

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file