Performance issue: high load, low cpu
We sometimes get into a situation where one of the nodes experiences high average load (>30) while the other nodes are just fine, and the CPU then drops to almost 0%. This causes huge increases in our app latency and CPU on all nodes drop to almost nothing.
We are running each node with 60GB memory and 16 vcpus with Vertica 9.0.1-3 on CentOS 7 (4.15.6-1.el7.elrepo.x86_64).
The system is not swapping, there is ~30GB of free memory. Disk is active but not highly loaded (although there is much more read activity than on the other nodes). Network packet in/out and KB in/out seem about normal.
Any ideas? What should I be looking at?
0
Comments
Hi,
Below points may help you
SELECT a.node_name,
a.requests,
ROUND((a.requests / b.total_requests) * 100, 2.0) AS percent
FROM (SELECT node_name,
COUNT() AS requests
FROM v_monitor.query_requests
GROUP BY node_name) a
CROSS JOIN (SELECT COUNT() AS total_requests
FROM v_monitor.query_requests) b
ORDER BY percent DESC;
If load is not balanced check load balance status and set load balance to "ROUNDROBIN" if not set.
If high CPU utilization only from vertica process then issue may be in running queries at that movement.
top -c (on high load nodes host)
check the cpu utilization with below query at vertica level.
select * from cpu_usage order by start_time desc;
Thanks will take a look.
I've looked at a number of these stats but still can't figure it out. The issue is that there is high load, but cpu utilization is not high.
When the system gets into this state, very little data is processed on any node in the cluster. The entire cluster comes to a halt. We've had to put in a script to detect high load and kill/restart the node. This week it has happened 3 times... It never happened prior to moving to 9.0.1-3.
I've checked:
I checked load balancing and it looks fine. We have it enabled both on the server and in the JDBC clients.
I looked at query_requests and don't see a large imbalance.
There is literally nothing else running on the machine (total cpu is around 30%).
I've looked at the number of open sessions, it isn't higher than normal
The number of connections to the machine (netstat) is low too.
The total memory used (RSS) is normal, not more than 50% of all memory.
There is very little disk IO