CPU usage per query

How can I tell how much CPU load is being generated by the queries that are currently executing?

I'm currently seeing rather high CPU load across the cluster, but none of the currently executing queries look much different than our ordinary load, which historically has had much lower CPU load.

Using "top" on each of the nodes, all the CPU load is from the vertica process, so it's not something else on the OS causing this.

Regardless of my current problem, I would really like to know how to monitory CPU usage by query. All I can find is ways to monitor overall CPU usage.

Comments

  • One way to do this is by using the Management Console.
    Or you can see the output of your system_resource_usage
    -this will give you an output of your cpu % per node on each minute.
    dbadmin=> select node_name,average_cpu_usage_percent,end_time from system_resource_usage;    node_name     | average_cpu_usage_percent |      end_time
    ------------------+---------------------------+---------------------
     v_vmart_node0001 |                     31.45 | 2014-06-11 15:16:00
     v_vmart_node0001 |                       1.4 | 2014-06-11 15:15:00
     v_vmart_node0001 |                       1.1 | 2014-06-11 15:14:00
     v_vmart_node0001 |                      1.38 | 2014-06-11 15:13:00
     v_vmart_node0001 |                      1.25 | 2014-06-11 15:12:00
    -this cpu output is responsible for the entire Vertica instance consumption.

    You could also use the client_pid from the sessions table, and with it you can output the cpu usage of that linux pid :
    top -p "client_pid"
    One other way to see your cpu usage is by running the explain command with the verbose option:
    dbadmin=> explain verbose select sum(sales_dollar_amount)/count(*) from store.store_sales_fact ;
    explain verbose select sum(sales_dollar_amount)/count(*) from store.store_sales_fact ;
     Access Path:
     LDISTRIB_UNSEGMENTED
     +-GROUPBY NOTHING [Cost: 5375.000000, Rows: 1.000000 Disk(B): 0.000000 CPU(B): 40000000.000000 Memory(B): 48.000000 Netwrk(B): 0.000000 Parallelism
    : 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 1)
     |  Aggregates: sum(store_sales_fact.sales_dollar_amount), count(*)
     |  LDISTRIB_UNSEGMENTED
    .....
     Estimated resources for plan: -----------------------------
     Scratch Memory MB: 3
     File Handles:      4
     Worker Threads:    4
     Blocking Threads:  0
     Externalizing Ops: 0
     Unbounded Mem Ops: 0
     Max Threads:     1

  • The client_pid is the process ID on the client machine, not the server, so that would not tell me anything about the CPU being used by the vertica process executing the query.

    As I understand it, the explain plan gives estimated resource usage. I need to know what is being used by a query that is currently executing, which, due to many different factors, may be completely different than the explain plan's estimate.

    Is there any way to break down CURRENT cpu usage by query?
  • Hi!

    It's not so easy to calculate a CPU usage itself, so Im afraid its not possible.
    But you can do a some synthetic test with pidstat and mpstat while passing a PID of Vertica and not client. Take a cpu usage of Vertica for a minute in IDLE; take usage with query and subtract "IDLE usage".

    PS

    But you can try (profile has counter - 
    clock time (us)
    How is CPU usage calculated?
    http://www.design-reuse.com/articles/8289/how-to-calculate-cpu-utilization.html
    https://my.vertica.com/docs/7.0.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/PROFILE.htm...


  • This sort of question is something that Vertica's Management Console ("MC") tries to answer, in a nice graphical way.  Are you familiar with the MC?

    Also -- maybe you have background system tasks running, in addition to your regular queries?  The MC should display these; you might also consider looking at, for example, the "tuple_mover_operations" system table, to see if there's a long-running mergeout or other ongoing operation.


    For what it's worth, to directly answer the question:  The "execution_engine_profiles" system table contains various counters that attempt to speak to resource usage of running queries.

    Many of "execution_engine_profiles"'s fields are undocumented.  Because we don't document what they do, we make no guarantees about how well they do it :-)  And it's certainly not easy to read.  But you may be able to gain some insights by poking around in there.

    (The table has columns for the session, transaction, and statement ID's.  Those three columns form a composite key that can be joined with many other system tables; from there you can figure out which query a particular row belongs to, etc.)
  • Following up on Daniel's comment - the "clock time (us)" counter is more of an elapsed time counter.  To find the cpu load from specific query paths/operators, use the "execution time (us)" counter.   Keep in mind that each instance of the operator in the profiling data represents a thread.  If you have 16 cores, and see 16 threads each with 100 seconds (100,000,000 us) of execution time, then those threads are fully using the cpus for those 100 seconds.  

      --Sharon

  • Hi Sharon,

    Want to dig deeper into it.
    Please correct me if I wrong anywhere.

    Suppose I have 3 node cluster.
    # nproc
    48
    Each node is 48 cores (cpu's)

    so when I run a query and they try to check the CPU usage of it. using this query
    SELECT node_name,, operator_name, counter_valueFROM v_monitor.execution_engine_profiles
    WHERE counter_name='execution time (us)' and session_id='xxxxx' and transaction_id=xxxxx and statement_id=xx 
    ORDER BY node_name, counter_value ,statement_id DESC;
    I get a resultset like this:
         node_name     |  operator_name | counter_value-------------------+--------------+---------------+---------------
     node1 | ExprEval      |            45
     node1 | TopK          |            72
     node1 | NetworkSend   |           119
     node1 | NewEENode     |           204
     node1 | NetworkSend   |           418
     node1 | Root          |           432
     node1 | NetworkSend   |          1374
     node1 | NetworkRecv   |          1788
     node1 | GroupByHash   |          5357
     node1 | NetworkRecv   |          5538
     node1 | GroupByPipe   |          8394
     node1 | Scan          |          9019
     node1 | StorageUnion  |         11239
     node1 | NetworkRecv   |         25086
     node1 | Scan          |         26708
     node1 | Scan          |         51723
     node1 | NetworkSend   |         60579
     node1 | GroupByHash   |         61320
     node1 | TopK          |         81825
     node1 | ExprEval      |        192245
     node1 | Join          |        198994
     node1 | StorageUnion  |        233082
     node1 | StorageUnion  |        333623
     node1 | NetworkRecv   |       1144999
     node1 | ExprEval      |       1942726
     node1 | Join          |       4368231
    The resultset continues for next two nodes also with same operator name 
    Going ahead, as you said, each instance of operator represent a thread, so I tried this query for node 1
    select operator_name,count(operator_name) from (SELECT node_name, operator_name, counter_value
    FROM v_monitor.execution_engine_profiles
    WHERE counter_name='execution time (us)' and session_id='xxxx' and transaction_id=xxxxx and statement_id=xx and node_name='node1' 
    ORDER BY node_name, counter_value ,statement_id DESC) sub group by operator_name;  operator_name | count  ---------------+-------   NetworkSend   |     4   ExprEval      |     3   GroupByPipe   |     1   Root          |     1   StorageUnion  |     3   GroupByHash   |     2   Join          |     2   NewEENode     |     1   TopK          |     2   Scan          |     3   NetworkRecv   |     4  (11 rows)  
     
    Lets consider operator scan and its value is 3, so does that mean only three CPU's from node1 participated in scanning process and for CPU usgae to 100 % it should include all CPU.i.e. 48 and the value for scan should be 48 ?

    Also, How can we calculate the CPU usage for the complete query on node 1 from this information.

    Thanks in Advance.

  • Hi Navin,

    Yes - your analysis is accurate regarding the number of threads used per operator.

    It's hard to quantify the cpu usage for the entire query, since some portions of a query will fan out over all cores, some over a subset of cores, some will execute on only a single core.  Some execute concurrently with other operators while some need to wait on data from other operators.  The "execution time" counter can provide insight into where the cpu time is being spent, but I don't tend to try to draw conclusions about total cpu usage from that data.

      --Sharon

  • Hi Navin!


    - Im afraid it possible to calculate accurate CPU usage on OpenVMS only(and not even on all hw).
    - You can find over Internet a lot of threads that "old" utils like top do not calculates CPU usage properly. What talk about sub-task of Vertica.
    - Since we dont know a couple things (I belive Vertica also manages a type of task priority heap - at least resource pool) like execution "quantum" time, priority queue  and so on its going to be impossible.
    - Over Internet you can find some example for MS SQL, but be careful it can be useful only as indicator against a differ query that returns same result only. But a Vertica cost its a same indicator and it accurate enough.

  • This gap in quantifying resource usage by query on a given cluster makes it difficult to detect and eliminate poorly written queries on a production cluster at capacity. It would be ideal if execution_engine_profiles or an alternative instrument would provide some CPU usage metrics in CPU seconds. 
    The inherent parallelism on a node can be accounted for by cumulating CPU usage on all CPUs ( cores) where the query is running and has run in the past. 
  • having same problem!!
    one query stuck and takes longer than usual and CPU usage is around %98 percent while memory usage is very low.
    same query runs normal every day. the problem i see is: its using a table  logins which is arount 300mil table
    running count or select limit 10 on this table takes too. even close_session  of query is not working;
    i believe the table corrupted somehow or something is get realy messed after last load to this table.

    had same issue before and truncating and reloading the table solved issue before.
    feel like similar issue

  • Hi ,

    Not clear what you call table logins , please elaborate .

  • @Ismail
    truncating and reloading the table solved issue
    What about IO? May be it's "mergeout" took all resources?

    @Eli
    Ismail means that query system tables like sessions(logins) or meta functions like "close_sessions()" can take a long time.


    Odd bug. I also familiar with this issue, sometimes projection refreshes can cause to this issue. But who cares from bugs (except Vertica clients)? I know about an open bug for more than 2 years and no one care.
  • Sorry for not being clear enough,
    login table is a table which we save customer logins. so its one of the biggest table with ~300mil record.





  • Hi Guys,
    im not 100 percent sure if this is the correct solution but i want to share how we solved our issue.

    in our scripts we are doing alot of delete inserts and we see that it is creating a lot of delete vectors.
    we started to run following  code daily at the end of our loads. and this code helped us a lot to get good performance and since then we haven't got into same problem.

    select make_ahm_now();
    select purge();

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file