Performance degrade with increased data

deb0687 · October 2015

Hi Team,

I am seeing performance degradation if data size is increasing in vertica.

Is this a know factor ?

For 2 days data the ETL is taking 1 hr but for 10 days data it's almost 8 hr.

I have done most of the tuning like
Updating statistics
Purging of delete vector
Unsegmented dimensions
Segmented fact
Merge join
Encoding
Sorting

It's happening for both single node and 3 node cluster

Thanks
Deb

deb0687 · October 2015

Hi Team,

Can you please share your thoughts for best tuning methods

Sharon_Cutter · October 2015

You'll need to do more analysis.

- You are comparing ETL on 2 days of data vs 10 days of data. Do you mean transforming and loading the entire 2 days of data / 10 days of data? Or do you mean an incremental load on top of pre-existing 2 days or 10 days of data?

- Are there one or two steps that are contributing to the slowdown more than other steps? If yes, compare the execution_engine_profiles for those steps.

- It could be that some step spills to disk more as the amount of data increases.

- Are there query retries that might be a factor?

--Sharon

deb0687 · October 2015

Hi Sharon,

Thanks for the attention to the issue.

-I am taking about the incremental load only. Means says we started the ETL 2 days back and after 8 days the time taken by the sqls for 'query_request' system table is drastically increasing. We are not using any select * from a table but we are taking appropriate columns required.

- We are running ETL hourly basis source (CSV) >> stage >> dim/fact >> aggregate

- Few updates are taking time and I tuned them by analyze_stat and purging of deleted records

Before it was taking 5 mins , now 2 mins.

- It could be that some step spills to disk more as the amount of data increases.

How I can check this ?

- There is no query retries as there is no node fail/network issue we ovserved.

Please suggest some vital points to tune like eg. resourcepool , tuplemover or any other parameter in yr mind.

Please share if you have any performance tuning guide apart from Vertica documents

Thanks,

Deb

Sharon_Cutter · November 2015

Collect Execution Engine level profiling data over time so that you can compare where the time is being spent, comparing the faster times to the slower times. If you are using MC the profiling tool included there will give you some insights, otherwise query the execution_engine_profiles data directly. Key counters in this case would be "clock time (us)" and the "%temp%" counters. Also worth confirming that the explain plan remains constant over time. You can pull those from dc_explain_plans.

Check the dc_requests_retried table to confirm that there aren't any query retries related to joins.

--Sharon

We're Moving!

Create My New Community Account Now

Performance degrade with increased data

Comments

Leave a Comment