Performance degrade with increased data
Hi Team,
I am seeing performance degradation if data size is increasing in vertica.
Is this a know factor ?
For 2 days data the ETL is taking 1 hr but for 10 days data it's almost 8 hr.
I have done most of the tuning like
Updating statistics
Purging of delete vector
Unsegmented dimensions
Segmented fact
Merge join
Encoding
Sorting
It's happening for both single node and 3 node cluster
Thanks
Deb
I am seeing performance degradation if data size is increasing in vertica.
Is this a know factor ?
For 2 days data the ETL is taking 1 hr but for 10 days data it's almost 8 hr.
I have done most of the tuning like
Updating statistics
Purging of delete vector
Unsegmented dimensions
Segmented fact
Merge join
Encoding
Sorting
It's happening for both single node and 3 node cluster
Thanks
Deb
0
Comments
Can you please share your thoughts for best tuning methods
You'll need to do more analysis.
- You are comparing ETL on 2 days of data vs 10 days of data. Do you mean transforming and loading the entire 2 days of data / 10 days of data? Or do you mean an incremental load on top of pre-existing 2 days or 10 days of data?
- Are there one or two steps that are contributing to the slowdown more than other steps? If yes, compare the execution_engine_profiles for those steps.
- It could be that some step spills to disk more as the amount of data increases.
- Are there query retries that might be a factor?
--Sharon
Hi Sharon,
Thanks for the attention to the issue.
-I am taking about the incremental load only. Means says we started the ETL 2 days back and after 8 days the time taken by the sqls for 'query_request' system table is drastically increasing. We are not using any select * from a table but we are taking appropriate columns required.
- We are running ETL hourly basis source (CSV) >> stage >> dim/fact >> aggregate
- Few updates are taking time and I tuned them by analyze_stat and purging of deleted records
Before it was taking 5 mins , now 2 mins.
- It could be that some step spills to disk more as the amount of data increases.
How I can check this ?
- There is no query retries as there is no node fail/network issue we ovserved.
Please suggest some vital points to tune like eg. resourcepool , tuplemover or any other parameter in yr mind.
Please share if you have any performance tuning guide apart from Vertica documents
Thanks,
Deb
Collect Execution Engine level profiling data over time so that you can compare where the time is being spent, comparing the faster times to the slower times. If you are using MC the profiling tool included there will give you some insights, otherwise query the execution_engine_profiles data directly. Key counters in this case would be "clock time (us)" and the "%temp%" counters. Also worth confirming that the explain plan remains constant over time. You can pull those from dc_explain_plans.
Check the dc_requests_retried table to confirm that there aren't any query retries related to joins.
--Sharon