High CPU usage

We observe high cpu usage on all nodes of cluster. It occurs second day in row and lasts for several hours in each day (3 hours in a first day and 4 hours in a second).
We had to turn off out ETL while CPU usage growth.

It looks like:

We are trying to find out reasons of that behavior.
Our observation and checks:
1. We observe great growth of waiting global catalog. Throughput of GCL significantly decrease.
2. RAM, Threads, FileHandles, Network consumption does not differ from usual day.
3. TM does not have any queue. It operates as usual. We can't see any unusual TM operations.
4. We checked all queries in sessions and system_sessions. All as usual, nothing suspicious. System services table shows us usual normal run rate.
5. We don't have any rebalance, partition reorganize operations while problem occurence.
6. We check all data_collector tables (about 200 tables). No success.
7. We can observe light correlation with disk latency. When cpu usage have short gap, disk latency significantly increase.
8. Used disk space slightly grow while high cpu usage.
9. WOS usage as usual.

Can anybody suggest us some ways to determine the reason of problem?
Thank you.

Comments

  • Hi can you check the merge out and move out operations? You might load data at that time and move operations give you this behavior

  • Hi lzayda,

    May i know which OS, how many nodes and vertica version is there in your cluster ?

    This is important before suggestions,

    Regards
  • Hi!

    We are using Debian 8.6. The cluster consists of 14 nodes and 1 stand by node. Version of Vertica is 8.0.1-3. There are no other apps on servers. Servers configuration is - 56 cores, 256GB RAM, 6TB disks (free space about 10-15%).

    CPU is consumed by couple of Vertica threads. We have tried to lsof this threads to investigate what ROS containers are used, by lsof show nothing. As we understand, all filehandles are opened through main process.


    ckotsidimos, how can we check what additional move/merge outs operations is processed? We can't see any strange operations in TUPLE_MOVER_OPERATIONS table. We turned off ETL and can only see few operations by analytics and external systems.

    Memorysize and Maxmemory size of WOS if 2GB.
    Memorysize and Maxmemory size of TM is 2/12GB, planned and maxconcurrency is 6.

    Graphs of wos used and wos spill (CPU problem was since 12:40 to 16:40):

    TM and WOS pools RAM usage:

  • Thanks for your answers and help. This problem is a good challenge for our database command.

    Also checked dc_tuple_mover_events table by two queries

    SELECT MAX(time)-MIN(time) as duration, SUM(total_size_in_bytes), MIN(time) as start, MAX(time) as end, transaction_id,user_name,operation,schema_name,table_name
    FROM dc_tuple_mover_events
    WHERE date_trunc('day',time)='2017-05-31' and operation!='Analyze Statistics'
    GROUP BY transaction_id,user_name,operation,schema_name,table_name
    ORDER BY 1 DESC;

    SELECT date_trunc('hour',time), COUNT(*), SUM(total_size_in_bytes)
    FROM dc_tuple_mover_events
    WHERE date_trunc('day',time)='2017-05-31' and operation!='Analyze Statistics'
    GROUP BY date_trunc('hour',time)
    ORDER BY 1 DESC;

    The longest TM operation is about 6 minutes. All operations with long duration or big total_size_in_bytes doesn't fit in problem interval from 12:40 to 16:40.
    The count of operations in that interval also decreased.

  • Out of the blue totally,
    1) Are there any new queries added to the process the last 2 days?
    2) Are you doing the data load to the stand by node?

    Can you check this also?
    https://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/TupleMover/TuningTheTupleMover.htm

    Regards,

  • 1) There's a possibility that there are some new queries. Firstly, we have tried to find out some new strange queries but everything looks like usual. I'll additionally check all data load streams.
    2) Nope, this node don't have any load. We keep it only for quick swap if one of normal node will be badly damaged.

    We have tuned TM two weeks ago according to that manual.
    Moreover, TM config was change 2 years ago to
    SELECT SET_CONFIG_PARAMETER('MoveOutInterval', 100);
    SELECT SET_CONFIG_PARAMETER('MergeOutInterval', 200);

    Are there any operations beside Move\Mergout and COPY parsing that can consume CPU in such way?

  • edited June 2017

    CPU can be consumed by poor queries, this is why I asked for this. Did you plan to run a designer session with the query load that you have and check the outcome without deploying?
    From my experience such things happen from poorly written queries or an unexpected change in the amount of data consumed.

    I'm not an HP employee

  • Thank you.
    We didn't start a designer while CPU problem. If problem will occure today, I'll run it.

    Ho can we find out the poor query in history now without designer? As i can understand resource_acquisitions table will not help us.

    Also, I have checked dc_load_events by the query:
    SELECT date_trunc('hour',time), COUNT(*) as start, SUM(rows_accepted)
    FROM dc_load_events
    WHERE date_trunc('day',time)='2017-05-31'
    GROUP BY date_trunc('hour',time)
    ORDER BY 1 DESC;

    Huge decrease in rows_accepted and load_events for problem period.

  • "Huge decrease in rows_accepted and load_events for problem period." This might happen if the db cannot cope with the load. I would say to run a profiler during the period that you expect the highest load. If you check the documentation you can find details on this. This way you will have all the queries for the period. Furthermore, you can always put a management console in another VM and you will get much more info from there.

    I mentioned the data load in the stand by node because this is used if you want to load data without much impact on the working nodes.

  • Hi Izayda.

    I am not well versed with Debian OS but i can share some idea w.r.t RHEL/CentOS which have same parental OS.

    Can you dump "perf top" command and dstat output from these nodes.

    These are linux commands but i not really sure whether they will work on Debian or not. Since, Debian is also member of Linux Family, so something related to "perf top" should work.

    Perf top will help to understand which kernel call is consuming CPU and we can then work accordingly.

    This method has helped us a lot to fix unknown CPU utilization and slower performance issues.

    Additionally, please try to dump CPU operating frequency which is found in /proc/cpuinfo. Check if our CPU is not throttling down to lower frequencies due to Idle or wait states.

    Regards,
    Raghav Agrawal

  • ckotsidimos, RaghavA Thank you! We will collect data by the next high CPU usage period and then will return to this topic.

  • skeswaniskeswani - Select Field - Employee
    1. high cpu usage is not a bad thing in itself, unless its resulting on some SLA being missed or workload getting timed-out. It just means you are getting the most out of your hardware. low cpu usage could also mean an over provisioned cluster. if a workload is cpu bound, vertica will push all cores to complete it faster (off course you can throttle it) but i am not sure that is desirable.

    2. Not sure if you already did this or not, its important to know what process is consuming the high cpu. is the increase in cpu (system cpu, nice cpu, iowait etc). and if its user/nice cpu which process is causing it (is it vertica process of some other, and what part of the cpu usage goes up)

    3. if you run perf top (run perf top -z ) and post a screen shot when the cpu usage is low and when its high. knowing the difference is relevant.

    4. is this a virtual (VM/AWS) instance or a bare metal instance.

    5. what is the load average (uptime) during high cpu usage (on all nodes). load average is a better measure of the cluster being overloaded than high cpu usage.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file