Bulk loading slow when source file has about 3 million rows
raulk89
Community Edition User ✭
I am using:
vertica-10.0.0-0.x86_64
CentOS Linux release 7.8.2003 (Core)
4x cpu
16gb memory
I have checked, cpu and memory seems like not an issue (more than half the memory is always free)
Vertica runs on 1 node and storage is on ext4 filesystem.
Source csv file is on localhost.
-- This loads 2 million rows to the table, and it takes 40 seconds vsql> COPY STG_SCHEMA.tab1 (col1, col2, col3, ..., colN) FROM '/home/dbadmin/test_2m.csv' WITH DELIMITER AS E'\t' SKIP 1; -- Now when bulk loading 3 million rows, it takes ~11 minutes. vsql> COPY STG_SCHEMA.tab1 (col1, col2, col3, ..., colN) FROM '/home/dbadmin/test_3m.csv' WITH DELIMITER AS E'\t' SKIP 1; -- 5 million rows takes ~20 minutes
Any suggestions, why it takes so much time..?
0
Answers
Can you share the DDL of the tables? Row count does not compare well. Row count and column count , and column types and column lengths influence the performance. Also the number of projections for each table could play a role.
Can you ...
?
These csv files sizes are as follows:
Table has 41 columns.
As Marco mentioned above, I think your column lengths (varchar 65k) probably influence the most.
Please consider the following:
1. If you are loading a table with many columns including long varchars change the LoadMergeChunkSizeK config parameter.
It turns out to have impact on the copy/load performance during the sorting phase.
Change the LoadMergeChunkSizeK parameter as an exception for specific wide tables load.
Default value (2048) may be too small for tables with a large # of columns and lots of varchar columns.
Do one test with 20480 --> SELECT SET_CONFIG_PARAMETER('LoadMergeChunkSizeK',20480);
And another with 204800 --> SELECT SET_CONFIG_PARAMETER('LoadMergeChunkSizeK',204800);
One of the reasons for a slow load is the amount of columns In the projection ORDER BY clause.
Recreate your schema with NO “SEGMENTED BY” clause and with only one INT field in the ORDER BY clause.
Because by default Vertica creates a projection with many fields in the sort order.
Split the 3M file and measure the time it takes to run 2 separated COPY commands in parallel, each loads a different file.
Do you load many empty values with the wide varchars?
If yes, consider to remove all whitespace (denoted by [:space:] in tr):
cat file_name | tr -d '[:space:]'
Please verify that EnableCooperativeParse configuration parameter is set to 1:
select * from vs_configuration_parameters where parameter_name = 'EnableCooperativeParse';
More info how to distribute the load can be found here:
https://www.vertica.com/docs/latest/HTML/Content/Authoring/DataLoad/UsingParallelLoadStreams.htm
Thanks, I will try it out later on.
Even if I have varchar data maximum 1000 char length - even then this column size matter..? (regardless if there is actual data 1000char vs 50k char ?)
But I tried this first parameter - LoadMergeChunkSizeK
And already I can see major improvements..
What is your recommendation regarding LoadMergeChunkSizeK parameter, should I increase it even more and if also what would be the upper limit (of which I should not exceed)..?
And what would be the downside of increasing this parameter..? (I mean will there be some other things that are performing poorly due to this parameter having bigger value?)
Also, have I understood correctly, that WOS is removed from 10.0 and all data is being written to ROS instead..?
So the DIRECT keyword does not have any impact any more..? (while testing, I got the same results with or without the DIRECT keyword)
Thanks
Raul
For best performance try to use data types which reflects your real data length.
For the most common use cases, default settings provide a good balance between the resources usage and the load elapsed time.
In rare cases when many very wide fields are required, measure the time it takes and select the best setting which fits your need.
More info about load tuning can be found here:
https://www.vertica.com/kb/Data-Loading-in-Vertica-Using-COPY/Content/BestPractices/Data-Loading-in-Vertica-Using-COPY.htm
Yes, in Vertica 10.0, WOS (Write Optimized Store) was deprecated, because today many of the original System limitation that lead to the creation of WOS do not exist anymore. Prior to Vertica 9.3, by default, Vertica initially loads data into WOS.
For databases created in version 9.3 and later, Vertica now uses a default load type of DIRECT.
This setting loads data directly to ROS, bypassing WOS.
So should I increase LoadMergeChunkSizeK value only for large table loading..?
And while loading smaller tables, I should set it back to default, which is 2048 ?
Or what is the downside, when leaving this parameter 20480 or 204800 (even for smaller tablles)?
This "SEGMENT BY" comes for "create projection" statement correct..?
I actually did not create projection manually, I just created staging table, and by that the projection was created by default.
SELECT EXPORT_OBJECTS('','STG_SCHEMA.tab1',FALSE);
And the output for projection as follows (done some replacing to improve readability).
I mean how I can create the desired projection, and what it would look like..?
(at the moment create table statement created this projection automatically)
Thanks
Raul
The hash‑segmentation‑clause specifies how to segment and distribute the data evenly across cluster nodes.
It doesn't have much influence in your case when there is just one node in the cluster.
For more info on DBD see: https://www.vertica.com/docs/latest/HTML/Content/Authoring/GettingStartedGuide/UsingDatabaseDeisgner/UsingDatabaseDesignerto.htm
Before I created table like this:
CREATE TABLE STG_SCHEMA.tab1 (.....);
Well, now I created table like this:
CREATE TABLE STG_SCHEMA.tab1 (.....) ORDER BY id;
While adding "order by id", I got rid of projection sort, but I still got segmentation by hash.
Projection looks like this now (it was created automatically, when I created table). How can I get rid of that "SEGMENTED BY hash(" ?
Note that projection type is now P (looks like this order by clause changed that)
Thanks
Raul
At first I created table like this:
CREATE TABLE STG_SCHEMA.tab1 (.....);
Well, now I created table like this:
CREATE TABLE STG_SCHEMA.tab1 (.....) ORDER BY id;
While adding "order by id", I got rid of projection sort, but I still got segmentation by hash.
Projection looks like this now (it was created automatically, when I created table). How can I get rid of that "SEGMENTED BY hash(" ?
Note that projection type is now P (looks like this order by clause changed that)
Thanks
Raul
@raulk89 -
Fyi...
P - CREATE TABLE WITH PROJ CLAUSE
https://www.vertica.com/blog/vertica-quick-tip-projection-create-types/
A projection on a columnar table is either segmented or unsegmented (replicated).
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/ConceptsGuide/Components/ProjectionSegmentation.htm
Is just ID a good candidate key to segment by?
https://www.vertica.com/blog/determining-candidate-segmentation-keys-quick-tip/
I never said I need to segment by ID field.
I understand that I probably need to have my tables UNSEGMENTED, since I have single node. Am I right here..?
I do not see possibility that schema can be created with “SEGMENTED BY” clause.
How can I create table without segmentation..?
Thanks
Raul
The projection definition would need to specify "UNSEGMENTED ALL NODES KSAFE 1" instead of the current "SEGMENTED BY ..." syntax.
I don't know if it will change the performance on a single node cluster, however.
So I need to manually create projection to get this UNSEGMENTED projection ?
So first:
create table statement:
CREATE TABLE STG_SCHEMA.tab1 (.....);
and then create projection statement;
Am I correct..?
Thanks
Raul
You can specify those directives in your CREATE TABLE statement.
CREATE TABLE STG_SCHEMA.tab1 (...)
ORDER BY tab1.id
UNSEGMENTED ALL NODES;
The K-Safety of a single node cluster is 0.
Per @Vertica_Curtis 's suggestion, here is an example:
Ok, thanks.
But before it was named tab1_super, when I do explain, it shows that it still accesses tab1_super
Does it mean that there I have actually 2 projections created (named: tab1 and tab1_super when specified ("order by id unsegmented all nodes ksafe 0")..?
Raul
Hi Raul,
did you run vioperf on your node? it wd be good to understand the speed of your disks
to sum up,
1. check your disks to see what can be expected from them
2. check how many projections you have per table
3. create unsegmented projection with just a few cols in order by
4. reduce the length of 65K varchar cols to the real ones
5. split your file into several parts and run concurrent COPYs - here your shd experiment and find the best concurrency.
hope this helps.
But what about my last question regarding projections.
I formated it as code. Not very good visually, but if you copy it to the notepad++ for example, then it is much easier to understand.
Thanks!
here is your limitation: 476 mb/s overall reading and writing in parallel. Just keep it in mind.
You have CPU scaling on:
*This machine's high load time: 122 microseconds.
*This machine's low load time: 582 microseconds.
Low load time shd not be much higher than high load. If you want to achieve better performance, work on it.
Re projections, pls, send the output of this: SELECT EXPORT_OBJECTS('/home/dbadmin/<outfile.sql>',',FALSE);
After you create a new projection, run select refresh('<>your_table_name'); - to populate it with the data.
After that run drop projection ... ; for the one you do not need any longer.