Direct copy slower in Vertica in Version 10
benslag
Vertica Customer ✭
Hi everybody,
We are migrating our Vertica database to Azure and at the same time we're upgrading from version 9.2 to 10.
For our analytic database we bulk load all of our tables daily to Vertica using Pentaho Data Integration. This goes quite well. Only in Version 10 the wide and large table take up far more time to load than in version 9.2.
One of those tables contains 250 million records and has 141 columns.
I've played with the number of maxconcurrency for TM (Tuple Mover) in the resource pool but to no avail.
Does anyone have an idea of where to look to improve the performance of the direct copy using ROS containers?
0
Answers
Try the relevant recommendation which best fit to your scenario.
If you are loading a table with many columns including long varchars change the LoadMergeChunkSizeK config parameter.
It turns out to have impact on the copy/load performance during the sorting phase.
Change the LoadMergeChunkSizeK parameter as an exception for specific wide tables load.
Default value (2048) may be too small for tables with a large # of columns and lots of varchar columns.
Do one test with 20480 SELECT SET_CONFIG_PARAMETER('LoadMergeChunkSizeK',20480);
And another with 204800 SELECT SET_CONFIG_PARAMETER('LoadMergeChunkSizeK',204800);
One of the reasons for a slow load is the amount of columns In the projection ORDER BY clause.
Recreate your schema with NO “SEGMENTED BY” clause and with only one INT field in the ORDER BY clause.
Because by default Vertica creates a projection with many fields in the sort order.
Measure the time it takes to run 2 separated COPY commands in parallel, each load different files.
Loading wide tables into flex tables requires loading one field instead of many fields.
Thus, it reduces the size of the catalog and improves overall database performance.
The initial load is fast, however, query performance is lower in comparison with columnar storage.
Use GROUPED correlated columns to load very wide tables.
If you have data that is always accessed together and it is not used in predicates,
you can increase query performance by grouping these columns.
Once grouped, queries can no longer independently retrieve from disk all records for an individual column independent of the other columns within the group.
You can: Group some of the columns: (a, GROUPED(b, c), d)
Group all of the columns: (GROUPED(a, b, c, d))
Create multiple groupings in the same projection: (GROUPED(a, b), GROUPED(c, d))
For example: CREATE PROJECTION ProjectionName (Field1 ENCODING RLE, GROUPED(Field2 ENCODING DELTAVAL, Field3))
AS (SELECT * FROM TableName) KSAFE 1;
Measure the time it takes to load files locally from the same node disks the COPY run from.
And compare the same load when the load files are located on S3.
“ON ANY NODE” is the default option for the COPY command when loading from HDFS or S3 paths and does not need to be specified.
However, the following conditions must be met to ensure parallel load:
A. The source files to load should be available on all nodes, so COPY opens the file and parses it from any node in the cluster.
B. The file must be the same on all nodes. If the file differs on two nodes, an incorrect or incomplete result is returned, with no error or warning.
C. Vertica attempts to apportion the load among several nodes if the file is large enough to benefit from apportioning.
D. If ERROR TOLERANCE or REJECTMAX is specified, Vertica chooses to load from a single node.
E. Use a wildcard (such as *.dat) to load multiple input files, combined with the ON ANY NODE clause.
F. “ON ANY NODE” is invalid with STDIN and LOCAL: STDIN can only use the initiator node, and LOCAL indicates a client node.
Do you load many empty values with wide varchars?
If yes, consider to remove all whitespace (denoted by [:space:] in tr)
cat file_name | tr -d '[:space:]'
Monitor the load process in MC and with queries like:
select * from dc_load_events where transaction_id=XYZ and statement_id=XYZ order by time;
To get the transaction_id and statement_id run first
select * from load_streams where is_executing;
Use the STREAM NAME parameter as part of the COPY statement so it will be easier to identify it in the LOAD_STREAMS system table.
Check how your schema is defined:
SELECT EXPORT_OBJECTS('','SCHEMANAME.TABLENAME',true);
If nothing from the above helps, consider to add more nodes to the cluster.
See also this thread:
https://forum.vertica.com/discussion/comment/245484
Hi Mosheg,
Thank you for your extensive answer concerning the slow direct copy.
The fist thing I tried was changing the LoadMergeChunkSizeK value. I assume that I don't have to restart the database in order to have an effect on the performance. In any case, this did not increase the performance by a lot. It was barely noticeable.
The next suggestion you describe is following:
One of the reasons for a slow load is the amount of columns In the projection ORDER BY clause.
Recreate your schema with NO “SEGMENTED BY” clause and with only one INT field in the ORDER BY clause.
Because by default Vertica creates a projection with many fields in the sort order.
First off: I'm not a real dba so I'm trying to understand what you're saying by recreating my schema.
Let me first explain a little bit about our way of working.
We build our datawarehouse using delta's in Mysql because a transactional database is updatet faster than Vertica.
After finishing in Mysql we bulkload the entire Mysql schema to vertica, table after table to a temporary schema called 'voorportaal'.
We do that by creating every table using ddl statements translated from Mysql to Vertica ddl an loading the data using Pentaho Vertica bulkloader which uses direct copy.
After finishing that process we move all tables from dm schema to backup schema and after that from 'voorportaal' to dm schema. We do that so that our users will hardly notice the interuption.
What would for us be the way to recreate (I assume) voorportaal schema withe the NO "SEGMENTED BY" clause?
kind regards,
Ben Slagmolen
From your explanation I understand you do not create any projection and use the default ones.
Auto-projections called superprojections are created automatically by Vertica for any new table.
Vertica uses the the MaxAutoSortColumns configuration parameter to determine how many columns are included in the auto-projection sort order. (Default is 8 and zero will do sort on all columns)
Reducing MaxAutoSortColumns to a smaller number will reduce the need to sort many columns at load time.
For example: ALTER DATABASE YOUR_DB_NAME SET MaxAutoSortColumns = 2;
For this parameter to take effect you will need to recreate the relevant table you try to load faster.