Vertica new cluster running slow

davds · May 2020

We just got a new cluster, suppose will be much faster than old one. but it was much slower than old one.
I have checked a few system tables and did not find anything.
I noticed even using "Copy " command or "Insert /*+ direct */ into tab a (select * from tab b)" are slow as well.
anything hint for this? much appreciate!!

mosheg · May 2020

The spill can happen on both environments, however the time it takes until it will finish the first plan of the query on a system with more memory might be longer.
It is advised to add a SPILL hint to the specific relevant query or better tune the query or JOIN to not exceed the query budget.

mosheg · May 2020

Try setting ENABLE_JOIN_SPILL at session level.
To profile the query use the qprof script provided by Maurizio Felici attached here.
Run the attached qprof shell script this way:
./qprof-0.4b.sh -f query.sql -o qprof.out

sahil_kumar · May 2020

Any other software, Hardware upgrade, Any changes in the ETL or reporting queries etc

Check if any of the system resource (memory,cpu,network or disk IO) saturated.
select * from memory_usage order by average_memory_usage_percent desc ;
select * from cpu_usage order by average_cpu_usage_percent desc ;
select * from network_usage order by (rx_kbytes_per_second+tx_kbytes_per_second) desc ;
select * from io_usage order by (read_kbytes_per_second+written_kbytes_per_second) desc ;

davds · May 2020

thanks, Kumar
memory and cpu are 20%, it should be fine.
network_usage top node around 74M/sec (tx) and 84M/sec (rx)
io_usage top node around: 30kb/sec for read and 169M/sec for written
are these metrics look normal?

davds · May 2020

another fact is: new nodes are 9x and old nodes are 8x. are there any configurations will be different setting between two versions?
I suspected maybe some configuration settings were not correct. I was an fan of Vertica, fast,powerful and easy to use.
very appreciate your feed back.

mosheg · May 2020

If your new cluster nodes have more memory, in case where V pick very large fact on inner,
and force spill did not happen, the first try can take much longer because there is more memory until it will spill on the second try.
Try to force join spill to occur on the first plan of the query, thus avoiding a replan.

-- Here’s a query that will determine all spills in the last 24hrs:
SELECT 
    time, 
    transaction_id, 
    statement_id, 
    request_id, 
    event_type, 
    event_description 
FROM dc_execution_engine_events 
WHERE (NOW() - time) < '24 hours' 
    AND event_type ILIKE '%SPILL%' 
ORDER BY time DESC;

             time              |  transaction_id   | statement_id | request_id |     event_type      |                           event_description             
-------------------------------+-------------------+--------------+------------+---------------------+-----------------------------------------------------------------------
 2020-05-19 11:28:27.776255-04 | 45035996273713491 |            1 |          0 | GROUP_BY_SPILLED    | GROUP BY key set did not fit in memory, using external sort grouping.
 2020-05-19 11:28:27.775592-04 | 45035996273713491 |            1 |          0 | GROUP_BY_SPILLED    | GROUP BY key set did not fit in memory, using external sort grouping.
 2020-05-19 11:28:27.77525-04  | 45035996273713491 |            1 |          0 | GROUP_BY_SPILLED    | GROUP BY key set did not fit in memory, using external sort grouping.
 2020-05-19 10:27:16.202986-04 | 45035996273713284 |            1 |          1 | SYSQUERY_POOL_SPILL | Sysquery Memory Full; spilled to temp ROS files
 2020-05-19 10:26:51.825093-04 | 45035996273713282 |            1 |          1 | SYSQUERY_POOL_SPILL | Sysquery Memory Full; spilled to temp ROS files
(5 rows)

davds · May 2020

thanks, Kumar
memory and cpu are 20%, it should be fine.
network_usage top node around 74M/sec (tx) and 84M/sec (rx)
io_usage top node around: 30kb/sec for read and 169M/sec for written
are these metrics look normal? > @mosheg said:

If your new cluster nodes have more memory, in case where V pick very large fact on inner,
and force spill did not happen, the first try can take much longer because there is more memory until it will spill on the second try.
Try to force join spill to occur on the first plan of the query, thus avoiding a replan.

-- Here’s a query that will determine all spills in the last 24hrs:
SELECT
time,
transaction_id,
statement_id,
request_id,
event_type,
event_description
FROM dc_execution_engine_events
WHERE (NOW() - time) < '24 hours'
AND event_type ILIKE '%SPILL%'
ORDER BY time DESC;

time | transaction_id | statement_id | request_id | event_type | event_description
-------------------------------+-------------------+--------------+------------+---------------------+-----------------------------------------------------------------------
2020-05-19 11:28:27.776255-04 | 45035996273713491 | 1 | 0 | GROUP_BY_SPILLED | GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 11:28:27.775592-04 | 45035996273713491 | 1 | 0 | GROUP_BY_SPILLED | GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 11:28:27.77525-04 | 45035996273713491 | 1 | 0 | GROUP_BY_SPILLED | GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 10:27:16.202986-04 | 45035996273713284 | 1 | 1 | SYSQUERY_POOL_SPILL | Sysquery Memory Full; spilled to temp ROS files
2020-05-19 10:26:51.825093-04 | 45035996273713282 | 1 | 1 | SYSQUERY_POOL_SPILL | Sysquery Memory Full; spilled to temp ROS files
(5 rows)

yes, new cluster has much more memory than old one.
I saw a lots records for this query(16530 rows), how to resolve this kind issues?

2020-05-19 09:23:56 49539595903924528 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 09:21:34 49539595903923862 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 09:13:45 49539595903920838 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 09:13:42 49539595903920838 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.

davds · May 2020

I found out how to force join spill , do you think we need do the sql statement level or database level?

davds · May 2020

I will try this out

davds · May 2020

thanks, it makes senses now and I will try it.
but slow performance was also on copy data to stage table sql like:
select distinct column_a, column_b,column_c from tab;
in this case, no join involved.
any thoughts?

thanks

[Deleted User] · May 2020

You could run linux tool "perf top" to identify functions were most of the time in spent and it may give you some hints. I would recommend opening support ticket and request support engineer get on a call with you to debug this.

davds · May 2020

has anyway to check configuration setting? I was thinking maybe some parameters were not setup properly.

sahil_kumar · May 2020

select * from vs_configuration_parameters where current_value <> default_value;
On config parameters changes check for the reason to modify any parameters default values. If there is no valid reason change it back to default.

davds · May 2020

thanks, Kumar. this is a very handy sql.
all the parameters are defalut values. we found even simple sql like: select distinct a,b,c from table is slower than old one.
new one has 32 nodes and old one only has 22 nodes. new clusters also has better hardware as well.
Anything else we can check it quickly? CPU and memory only 20%.

Bryan_H · May 2020

Hi, one other check is resource pool status:
select * from resource_pool_status;
It's possible that a user-defined resource pool, or even TM, may have old settings for memory, number of threads etc. so check resource_pool_status and compare to resource pool settings. If the memory is always at max or not enough threads are in use (less than number of cores) in resource_pool_status, you might want to adjust memory and concurrency settings for the new nodes number of CPU and memory.
Also, please check sysctl settings vm.dirty_ratio and vm.dirty_background_ratio since these may need tuning if there is more memory in the new boxes: https://www.vertica.com/kb/Tuning-Linux-Dirty-Data-Parameters-for-Vertica/Content/BestPractices/Tuning-Linux-Dirty-Data-Parameters-for-Vertica.htm

davds · May 2020

thanks, check resource_pool_status, we did not has user-defined resource pool. memory usage look ok.
since this new cluster is a brand new cluster, we did not upgrade from old one. and there are more nodes and more powerful hardware.
it should be faster...is it possible more nodes caused more data transmit?
I did select * from table, it comes back quick. but select distinct a,b,c from table, it became slow. since sort involved?
any thought?
thanks

sahil_kumar · May 2020

-Run “select 1” and other small queries to test system performance and where the txn spending more time.

\timing
profile select 1;
select node_name,transaction_id,statement_id,execution_step,(completion_time - time) duration from scrutin. dc_query_executions where transaction_id=;

Run analyze_workload() and review the tunning_recommendations table for any recommendations.

select analyze_workload(''); and check in tunning_recommendations table
select time,node_name,observation_description,tuning_type,tuning_parameter,tuning_description from dc_tuning_recommendations order by time desc;

Check if any node is slower consistently compared to other nodes in the cluster by running a similar query like select 1;

select 1;
\timing
select
last_stmt.time as rollback_time,
txn_end.node_name as node_name,
txn_end.time - last_stmt.time as total_delay,
txn_end.end_begin_time - last_stmt.time as invoke_delay,
txn_end.time - txn_end.end_begin_time as response_delay
from
v_internal.dc_transaction_ends as txn_end,
(select transaction_id, time from v_internal.dc_requests_issued
where node_name = local_node_name() and session_id = current_session() and transaction_id != current_trans_id()
order by time desc
limit 1) as last_stmt
where last_stmt.transaction_id = txn_end.transaction_id
order by total_delay desc;

davds · May 2020

thanks, I have followed all the steps. I do see tuning_recommandendation ask to run designer. I have run incremental designer and apply new projections, but it seems like not help.
I also used your sql to check any slow nodes, but no nodes has noticeable slow response time.
since I mentioned before, when I do distinct, it become slow.
select distinct a,b,c from table;
is it possible master node doing sort,merge are slow? how to check that?

LenoyJ · May 2020

since this new cluster is a brand new cluster, we did not upgrade from old one. and there are more nodes and more powerful hardware.

When you mean 'powerful hardware', has your total core count gone up?

davds · May 2020

yes...more nodes with better hard ware. ( 28 core CPU and 768G memory). old system is much small than new one.

davds · May 2020

new cluster has 32 nodes vs old cluster has 22 nodes. how to check network performance between nodes? is it sound valid reason?

DaveT · May 2020

I recommend taking a step back from the database and running the v*perf utilities if you have not already done that. Make sure that your environment is performant before you investigate the database.

mosheg · May 2020

1) Many health check indications are covered in Vertica Advisor report. As @skamat suggested it is advised to open a support case, create and upload Scrutinize file to Vertica S3 and receive the Advisor report.
2) Verify all projections are distributed evenly between all cluster nodes:
select projection_name, sum(used_bytes)/1024^3 as GB from projection_storage group by projection_name order by 2;
3) In addition, compare the output of the following query with Warning and Critical Event Types listed here:
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/SystemTables/MONITOR/QUERY_EVENTS.htm

select event_type,event_description, count(*)
from query_events
where ( NOW() - EVENT_TIMESTAMP) < '48 hour'
group by 1,2
order by 3 desc;

We're Moving!

Create My New Community Account Now

Vertica new cluster running slow

Best Answers

Answers

Leave a Comment