Data Load limitation
Is there a limit to the number of concurrent data loading sessions that can work in parallel in a given Vertica cluster ?
On our 7 node, 32 core system , we see that at most 30-32 sessions run in parallel and anything more will wait. Is this related to the # of cores per node or something else ?
If yes then having a 7 node or 17 node would not make a difference in terms of the throughput - Is this correct ?
We are trying to size a system based on the fact that we will need to load approximately 2 GB/sec.
0
Comments
Concurrency in Vertica is governed by resource pools.
Each COPY statement requires some system resources on all Vertica nodes, assuming it is loading data into a table that is segmented (or replicated) across all nodes. By default, each COPY will reserve (in particular) a relatively large amount of RAM, to accelerate the process of sorting data. You can modify your resource-pool settings to reduce the RAM allocated to each query. This will allow you to run more COPY statements, though each COPY will run more slowly.
For what it's worth -- for that data volume, depending heavily on your hardware and your projection layout, I would expect you to see optimal performance with around 4 or so COPY statements. (For smaller load rates, a single statement is generally sufficient and near-optimal.) 32 is far more than I would expect you to need. The trick is to list a large number of files/inputs in each COPY statement. At least for local data sources (the JDBC standard unfortunately doesn't provide for the notion of multiple data streams, so unfortunately does not perform as well...), we will load the files in parallel.
A single statement with many concurrent files is more efficient than many statements with one or a few files each because, with many statements, we have to preserve the ability to roll back just one COPY if it errors out, while allowing all the other concurrent COPY's to proceed without interference. This means we have to set up more isolation between statements, which requires using more system resources, etc.
(So why is a single statement with a huge list of files, always fastest? Honestly, because our code isn't perfect :-) This has been improving over time; newer versions of Vertica need fewer separate COPY statements in order to achieve ideal performance. For smaller clusters, a single COPY statement is often sufficient.)
Incidentally -- 2GB/sec is a relatively high load rate. That's filling several hard disks per hour; several PB of data per year. We certainly have customers who load that and more. But I would strongly encourage you to get in touch with our sales engineers. (If you aren't already in touch, let me know; I can ask them to reach out to you.) Scaling to those load rates, depending on your infrastructure, can put substantial stress on the Linux kernel/drivers and network hardware that you are using. We have a team of engineers here that has substantial experience working through those sorts of infrastructure issues; they can probably give you some good advice.
Adam
Thanks Adam. Appreciate the feedback.
"A single statement with many concurrent files is more efficient than many statements with one or a few files, I would expect you to see optimal performance with around 4 or so COPY statements"
To that point I did some tests and results as follows :
1) 500 MB * 4 files with 32 COPY took 12 seconds to load 16 GB ( ~ 1.3 GB/sec throughput) WOSDATA=20GB
2) 250 MB * 4 files with 32 COPY took 7 seconds to load 7.8 GB (~ 1.1 GB/sec throughput) WOSDATA=10GB
3) 10 MB * 800 files with 4 COPY took 44 seconds to load 8.5 GB (~ 197 GB/sec throughput) WOSDATA=10GB
This seem to indicate otherwise or Am I missing something ?
Also, Will adding more nodes allow the copy statements to finish more quickly since it has more resources ?
If so is there a scaling factor per node that we can arrive at ?
How would one size the cluster (# of node required) to be able to achieve a specific load rate ?
Hm... Could you post the exact COPY statements that you are running? (Feel free to obfuscate the filenames; I'm interested in the exact syntax/arguments to COPY that you're using.)
For adding more nodes: So, that's a little bit tricky to figure out.
COPY has two phases. The first phase parses your data (locally on the node where the file is stored), then segments it and ships each row out to the node or nodes where that particular row should be stored, based on the segmentation clauses on each projection in the destination table. (Note that your projection design has a substantial impact on load performance -- each new projection means sending out an extra copy of the data at this stage.)
The second phase sorts the data and writes it to disk. (If you load into WOS, this cost is deferred until the next moveout operation. That doesn't mean that it doesn't need to happen; it just means that it's more complicated to benchmark streaming loads.)
The first phase is usually limited by CPU on the nodes that you are using to parse data. The first step there is to make sure you're using lots of nodes to parse data. After that, adding more nodes can help. The core of the parsing phase should scale linearly; there's no interaction between independent files being parsed on separate nodes.
But it can also be limited by available network bandwidth, depending on how many times your data is replicated (how many projections you have, segmented vs unsegmented, etc) and on the structure of your network. That's probably not the case for your tests here, but it usually becomes the bottleneck on large clusters or clusters with not-very-fast network layouts. The network is what makes computing a scaling factor hard, for high data rates -- you may not have enough bandwidth; and even if you do, some network setups don't behave as you might like as they approach their maximum load. (2 GB/sec is 16gbps, bits vs bytes; that load rate exceeds what a "cheap" 10gbps network will be able to handle, so will require some provisioning thought for higher load rates.)
The second phase is typically I/O-bound, and made slower by having less RAM. With unsegmented projections, all nodes have to do the same amount of work, so adding more nodes doesn't help. But when loading into segmented projections, each node stores, and has to sort, 1/#nodes of the data (actually, #projections/#nodes of the data), so this phase will scale with more machines. Note that it won't necessarily scale linearly: Sorting is not a linear algorithm; even the best sorting algorithms will take more than twice as long to sort twice the data, so will improve by less than 2x if you cut the data size in half. Vertica in particular has the concept of a "multi-phase mergeout", if you load so much data that we have to split the sort into so many temp files that we can't even open them all at once. You're certainly not hitting that in these tests, but it's something to be aware of. Again, our sales/field engineers can help you with this; if you can avoid multi-phase mergeout, this part is usually close enough to linear for a good approximate measurement.
Adam
The copy commands were run from a shell script.
For test 1 (250 MB) & 2 (500 MB)
This statement was executed via 32 parallel sessions
COPY fact_raw_flow_dbops FROM LOCAL '/tmp/test/fact_raw_flow_1.txt' DELIMITER E'\t' NO ESCAPE NULL AS 'NULLNULLNULL';
For test 3:
COPY fact_raw_flow_dbops FROM LOCAL '/tmp/test/1*' DELIMITER E'\t' NO ESCAPE NULL AS 'NULLNULLNULL';
Where the "1*" translated to 800 * 10 MB files.
Note that the files in all test cases were located on the first node.
Please let us know if there is an alternate approach to increase performance by keeping the COPY statement count low (1-4).
COPY LOCAL unfortunately doesn't parallelize as well as regular COPY at this time. Try just omitting the "LOCAL" keyword (otherwise exactly the same query). You should see substantially different performance.
You will see better performance if you have the files distributed to more than one node. You can load data located on multiple nodes with a single copy statement with syntax like:
COPY fact_raw_flow_dbops FROM '/tmp/test/1*' ON node1, '/tmp/test/2*' ON node2, '/tmp/test/3*' ON node3 DELIMITER E'\t' NO ESCAPE NULL AS 'NULLNULLNULL';
(You will of course have to adjust the filenames and paths; also, use your actual node names rather than "node1", "node2", etc.)
Adam
Hello Adam, Here it result of the test (test4) based on your inputs. While it does perform better than the test 3, it seems 32 COPY still performs much better :
4) 500 MB * 4 files, each set located on all 7 nodes with 1 COPY took 20 seconds to load 16 GB (~ 819 MB/sec throughput) WOSDATA=10GB
dbadmin=> COPY fact_raw_flow_dbops FROM
dbadmin-> '/tmp/test/fact_raw_flow_*' ON v_node0001,
dbadmin-> '/tmp/test/fact_raw_flow_*' ON v_node0002,
dbadmin-> '/tmp/test/fact_raw_flow_*' ON v_node0003,
dbadmin-> '/tmp/test/fact_raw_flow_*' ON v_node0004,
dbadmin-> '/tmp/test/fact_raw_flow_*' ON v_node0005,
dbadmin-> '/tmp/test/fact_raw_flow_*' ON v_node0006,
dbadmin-> '/tmp/test/fact_raw_flow_*' ON v_0007 DELIMITER E'\t' NO ESCAPE NULL AS 'NULLNULLNULL';
Rows Loaded
-------------
12410832
(1 row)
Time: First fetch (1 row): 20089.606 ms. All rows formatted: 20089.659 ms
It looks like you changed the WOSDATA size. It's 10GB in query 4 but 20GB in query 1. With a 16gb dataset, that difference could matter, depending on storage formats / etc.
If you are doing performance testing, are you running commands to flush caches and buffers (Vertica and OS) between tests, so that consecutive queries don't interfere with each other?
Also, the recommendation here is one COPY with *lots of files*. The recommended number depends on which Vertica version you are testing and on what configuration options and resource-pool settings you have set; recent versions of Vertica are capable of using multiple CPU cores while parsing each file if given enough RAM, while older versions are not. But 32 files *per node* per COPY (so, one file per CPU core in your cluster) would be an interesting arrangement to test.
Getting meaningful benchmark numbers for any complex system is surprisingly difficult. I would again strongly encourage you to get in touch with your company's Vertica rep to ask about this stuff. If you can arrange this, I think a quick phone call with an appropriate person on our end would help you sort this stuff out much more quickly than a long forum back-and-forth.
Adam
Ah that was a typo. The test was indeed run with exact same setting for WOS as the previous best. So it was "20 GB WOSDATA".
We are simulating a production load for sizing purpose. So I am not flushing cache at the OS level as this would not be practical in reality.
FYI, We are using Vertica version 7.0.1.4.
I have logged a ticket with Vertica support to get further assistance.
Thanks again for help on this issue. Much appreciated !
You mention the JDBC data load limitaion , with regard of the notion of multiple data streams , we have customer use case whete we need to load 6T per day (around 70Mps) , and we are using today parallel batch insert (16 threads ) , and we are not able to achieve anytink close to that .
Looks like JDBC batch insert has scale limitaion , can you advice what can be done ?
Thanks .