Vertica new cluster running slow
We just got a new cluster, suppose will be much faster than old one. but it was much slower than old one.
I have checked a few system tables and did not find anything.
I noticed even using "Copy " command or "Insert /*+ direct */ into tab a (select * from tab b)" are slow as well.
anything hint for this? much appreciate!!
Best Answers
-
mosheg Vertica Employee Administrator
The spill can happen on both environments, however the time it takes until it will finish the first plan of the query on a system with more memory might be longer.
It is advised to add a SPILL hint to the specific relevant query or better tune the query or JOIN to not exceed the query budget.1 -
mosheg Vertica Employee Administrator
Try setting ENABLE_JOIN_SPILL at session level.
To profile the query use the qprof script provided by Maurizio Felici attached here.
Run the attached qprof shell script this way:
./qprof-0.4b.sh -f query.sql -o qprof.out5
Answers
Any other software, Hardware upgrade, Any changes in the ETL or reporting queries etc
select * from memory_usage order by average_memory_usage_percent desc ;
select * from cpu_usage order by average_cpu_usage_percent desc ;
select * from network_usage order by (rx_kbytes_per_second+tx_kbytes_per_second) desc ;
select * from io_usage order by (read_kbytes_per_second+written_kbytes_per_second) desc ;
thanks, Kumar
memory and cpu are 20%, it should be fine.
network_usage top node around 74M/sec (tx) and 84M/sec (rx)
io_usage top node around: 30kb/sec for read and 169M/sec for written
are these metrics look normal?
another fact is: new nodes are 9x and old nodes are 8x. are there any configurations will be different setting between two versions?
I suspected maybe some configuration settings were not correct. I was an fan of Vertica, fast,powerful and easy to use.
very appreciate your feed back.
If your new cluster nodes have more memory, in case where V pick very large fact on inner,
and force spill did not happen, the first try can take much longer because there is more memory until it will spill on the second try.
Try to force join spill to occur on the first plan of the query, thus avoiding a replan.
thanks, Kumar
memory and cpu are 20%, it should be fine.
network_usage top node around 74M/sec (tx) and 84M/sec (rx)
io_usage top node around: 30kb/sec for read and 169M/sec for written
are these metrics look normal? > @mosheg said:
yes, new cluster has much more memory than old one.
I saw a lots records for this query(16530 rows), how to resolve this kind issues?
2020-05-19 09:23:56 49539595903924528 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 09:21:34 49539595903923862 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 09:13:45 49539595903920838 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 09:13:42 49539595903920838 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
I found out how to force join spill , do you think we need do the sql statement level or database level?
I will try this out
thanks, it makes senses now and I will try it.
but slow performance was also on copy data to stage table sql like:
select distinct column_a, column_b,column_c from tab;
in this case, no join involved.
any thoughts?
thanks
You could run linux tool "perf top" to identify functions were most of the time in spent and it may give you some hints. I would recommend opening support ticket and request support engineer get on a call with you to debug this.
has anyway to check configuration setting? I was thinking maybe some parameters were not setup properly.
select * from vs_configuration_parameters where current_value <> default_value;
On config parameters changes check for the reason to modify any parameters default values. If there is no valid reason change it back to default.
thanks, Kumar. this is a very handy sql.
all the parameters are defalut values. we found even simple sql like: select distinct a,b,c from table is slower than old one.
new one has 32 nodes and old one only has 22 nodes. new clusters also has better hardware as well.
Anything else we can check it quickly? CPU and memory only 20%.
Hi, one other check is resource pool status:
select * from resource_pool_status;
It's possible that a user-defined resource pool, or even TM, may have old settings for memory, number of threads etc. so check resource_pool_status and compare to resource pool settings. If the memory is always at max or not enough threads are in use (less than number of cores) in resource_pool_status, you might want to adjust memory and concurrency settings for the new nodes number of CPU and memory.
Also, please check sysctl settings vm.dirty_ratio and vm.dirty_background_ratio since these may need tuning if there is more memory in the new boxes: https://www.vertica.com/kb/Tuning-Linux-Dirty-Data-Parameters-for-Vertica/Content/BestPractices/Tuning-Linux-Dirty-Data-Parameters-for-Vertica.htm
thanks, check resource_pool_status, we did not has user-defined resource pool. memory usage look ok.
since this new cluster is a brand new cluster, we did not upgrade from old one. and there are more nodes and more powerful hardware.
it should be faster...is it possible more nodes caused more data transmit?
I did select * from table, it comes back quick. but select distinct a,b,c from table, it became slow. since sort involved?
any thought?
thanks
-Run “select 1” and other small queries to test system performance and where the txn spending more time.
\timing
profile select 1;
select node_name,transaction_id,statement_id,execution_step,(completion_time - time) duration from scrutin. dc_query_executions where transaction_id=;
select analyze_workload(''); and check in tunning_recommendations table
select time,node_name,observation_description,tuning_type,tuning_parameter,tuning_description from dc_tuning_recommendations order by time desc;
select 1;
\timing
select
last_stmt.time as rollback_time,
txn_end.node_name as node_name,
txn_end.time - last_stmt.time as total_delay,
txn_end.end_begin_time - last_stmt.time as invoke_delay,
txn_end.time - txn_end.end_begin_time as response_delay
from
v_internal.dc_transaction_ends as txn_end,
(select transaction_id, time from v_internal.dc_requests_issued
where node_name = local_node_name() and session_id = current_session() and transaction_id != current_trans_id()
order by time desc
limit 1) as last_stmt
where last_stmt.transaction_id = txn_end.transaction_id
order by total_delay desc;
thanks, I have followed all the steps. I do see tuning_recommandendation ask to run designer. I have run incremental designer and apply new projections, but it seems like not help.
I also used your sql to check any slow nodes, but no nodes has noticeable slow response time.
since I mentioned before, when I do distinct, it become slow.
select distinct a,b,c from table;
is it possible master node doing sort,merge are slow? how to check that?
When you mean 'powerful hardware', has your total core count gone up?
yes...more nodes with better hard ware. ( 28 core CPU and 768G memory). old system is much small than new one.
new cluster has 32 nodes vs old cluster has 22 nodes. how to check network performance between nodes? is it sound valid reason?
I recommend taking a step back from the database and running the v*perf utilities if you have not already done that. Make sure that your environment is performant before you investigate the database.
1) Many health check indications are covered in Vertica Advisor report. As @skamat suggested it is advised to open a support case, create and upload Scrutinize file to Vertica S3 and receive the Advisor report.
2) Verify all projections are distributed evenly between all cluster nodes:
select projection_name, sum(used_bytes)/1024^3 as GB from projection_storage group by projection_name order by 2;
3) In addition, compare the output of the following query with Warning and Critical Event Types listed here:
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/SystemTables/MONITOR/QUERY_EVENTS.htm