High CPU on new node

Hi all,

Recently we added a 4th node to our 3-node cluster (Vertica 7.2.0-1 on AWS r3.2xlarge instances).

The process of adding the node was pretty painless. The only issue we're having is the fact that the new node seems to use twice as much CPU as the other three nodes:

select node_name, trunc(AVG(average_cpu_usage_percent), 1) from cpu_usage WHERE start_time BETWEEN NOW() - INTERVAL '15 minutes' AND NOW() group by node_name;

v_node0001 30.4
v_node0002 31.1
v_node0003 29.5
v_node0004 58.7

At first we suspected the rebalancing process was still running, but 'select * from system_sessions where session_type = 'REBALANCE_CLUSTER' and is_active = true;' returns zero results.

Checking the logs on Node 4 does show a 'rebalance_cluster(background)' transaction that keeps starting and rolling back immediately every 5 minutes:

2017-02-07 09:42:14.010 RebalanceCluster:0x7f106c014ee0-d000000010d377 [Txn] <INFO> Begin Txn: d000000010d377 'rebalance_cluster(background)'
2017-02-07 09:42:14.010 RebalanceCluster:0x7f106c014ee0-d000000010d377 [Txn] <INFO> Rollback Txn: d000000010d377 'rebalance_cluster(background)'
2017-02-07 09:42:14.017 RebalanceCluster:0x7f106c014ee0 [Util] <INFO> Task 'RebalanceCluster' enabled

The same is not found in the logs on the other nodes.

One of the few pages in the documentation that address high CPU usage is: https://my.vertica.com/docs/7.2.x/HTML/Content/Authoring/AdministratorsGuide/Monitoring/Vertica/MonitoringLinuxResourceUsage.htm , which suggests setting the swappiness parameter to 0. Changing this parameter did not have any impact.

I hope someone can point me in the right direction, trying to find a cause for the CPU usage on the new node.

Comments

  • Did you find the root cause for this?

    If not, here are a few suggestions:

    • Verify in Vertica's HOST_RESOURCES table that the configuration of all four hosts is identical
    • Verify that the operating system also shows 2x the cpu usage on the new host
    • Verify using top that it's Vertica using the cpu
    • Check the PROJECTION_STORAGE table to verify that the projection data is distributed evening across all nodes
    • PROFILE a query that takes at least a few seconds, and review the 'execution time (us)' counter's values in EXECUTION_ENGINE_PROFILES to see if it shows that the query plan operators are running 2x as long on the new node. Compare by operator_name and path_id.

      --Sharon

  • To Sharon's recommendations, I will add too to check if the tuple mover is not working in that node. After the rebalance that node may have files from the other nodes and Tuple Mover needs to merge those files. You can verify if TM is running in the tuple_mover_operations table it has a column is_executing.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file