Performance issue: high load, low cpu

We sometimes get into a situation where one of the nodes experiences high average load (>30) while the other nodes are just fine, and the CPU then drops to almost 0%. This causes huge increases in our app latency and CPU on all nodes drop to almost nothing.

We are running each node with 60GB memory and 16 vcpus with Vertica 9.0.1-3 on CentOS 7 (4.15.6-1.el7.elrepo.x86_64).

The system is not swapping, there is ~30GB of free memory. Disk is active but not highly loaded (although there is much more read activity than on the other nodes). Network packet in/out and KB in/out seem about normal.

Any ideas? What should I be looking at?

Comments

  • krishnamrajukrishnamraju Community Edition User

    Hi,

    Below points may help you

    1. May be that single node is processing more data than other nodes. or
    2. Check load balance and no.of requests per node.
      SELECT a.node_name,
      a.requests,
      ROUND((a.requests / b.total_requests) * 100, 2.0) AS percent
      FROM (SELECT node_name,
      COUNT() AS requests
      FROM v_monitor.query_requests
      GROUP BY node_name) a
      CROSS JOIN (SELECT COUNT(
      ) AS total_requests
      FROM v_monitor.query_requests) b
      ORDER BY percent DESC;

    If load is not balanced check load balance status and set load balance to "ROUNDROBIN" if not set.

    1. Check any other applications are running on the host while load is high.
      If high CPU utilization only from vertica process then issue may be in running queries at that movement.

    top -c (on high load nodes host)

    check the cpu utilization with below query at vertica level.
    select * from cpu_usage order by start_time desc;

    1. Try to identify the queries which are taking high cpu cycles from v_monitor.execution_engine_profiles.
  • Thanks will take a look.

  • I've looked at a number of these stats but still can't figure it out. The issue is that there is high load, but cpu utilization is not high.

    When the system gets into this state, very little data is processed on any node in the cluster. The entire cluster comes to a halt. We've had to put in a script to detect high load and kill/restart the node. This week it has happened 3 times... It never happened prior to moving to 9.0.1-3.

    I've checked:

    • I checked load balancing and it looks fine. We have it enabled both on the server and in the JDBC clients.

    • I looked at query_requests and don't see a large imbalance.

    • There is literally nothing else running on the machine (total cpu is around 30%).

    • I've looked at the number of open sessions, it isn't higher than normal

    • The number of connections to the machine (netstat) is low too.

    • The total memory used (RSS) is normal, not more than 50% of all memory.

    • There is very little disk IO

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file