Query to find slowest node

Hi!

In Vertica presentations back from 2020

https://www.vertica.com/wp-content/uploads/2020/04/Optimizing-Query-Performance-and-Resource-Pool-Tuning.pdf

query
SELECT 1
had been suggested as a way to detect a slow node in cluster (slide 28).

Time flies, Vertica going ahead and more efficient, and SELECT 1 now being executed on single initiator node and no longer serve suggested purpose of detecting slow node.

Can you suggest replacement - how I can "relatively easily" detect slowest node in large cluster?

Thank you

Sergey

Best Answer

Answers

  • I'm wondering under what kind of conditions one node would be slower than other nodes? All nodes should configured the same, and physically be the same, so what are possible reasons why one node could be slower than other nodes?

    In theory, any select from a replicated (unsegmented) table, executed on each individual node, should give you an easy way to compare performance numbers between nodes. That would be the most consistent way of testing node performance, in my mind.

  • Yes, in perfect world all nodes should be born equal...
    There are plenty of reasons why node can get slow.
    Just blindly running same script from presentation on my TST cluster show something is going on with node 6:

    ./node_perf tst
    IP address masked
    Timing is on.
    1
    Time: First fetch (1 row): 3.799 ms. All rows formatted: 3.808 ms
    IP address masked
    Timing is on.
    1
    Time: First fetch (1 row): 3.709 ms. All rows formatted: 3.717 ms
    IP address masked
    Timing is on.
    1
    Time: First fetch (1 row): 3.738 ms. All rows formatted: 3.746 ms
    IP address masked
    Timing is on.
    1
    Time: First fetch (1 row): 3.969 ms. All rows formatted: 3.978 ms
    IP address masked
    Timing is on.
    1
    Time: First fetch (1 row): 3.619 ms. All rows formatted: 3.626 ms
    IP address masked
    Timing is on.
    1
    Time: First fetch (1 row): 27.104 ms. All rows formatted: 27.115 ms

    Seems to be, running SELECT 1 on each node in cluster is the easiest way for quick and dirty check of cluster sanity.
    It does look like recommendations from presentations stands.

    Thanks for answer!

  • Thanks, that is a correct way of action if presence of problematic host detected.

  • moshegmosheg Vertica Employee Administrator

    In addition, to find the slower node for a specific query
    Create a file, starting with your canary query, like in the following example:

    \o /dev/null 
    your_canary_query_here_is_the_first_query_in_the_file
    \o
    
    select node_name,execution_step,max(completion_time-time) as duration
    from dc_query_executions
    where transaction_id=current_trans_id() and statement_id = 1
    group by node_name,execution_step
    order by execution_step, duration desc;
    
  • edited June 2023

    That is a great idea!

    Vertica already have a bunch of stats on all queries in dc_query_executions, we can go against all queries:

    with /*+ ENABLE_WITH_CLAUSE_MATERIALIZATION*/ 
     node_steps as (
    select
      execution_step,
      node_name,
      avg(completion_time - time) as avg_time,
      count(*) cnt
    from
      dc_query_executions
    group by
      execution_step, node_name ), 
     slowest as (
    select ns.execution_step, ns.avg_time slowest_time, ns.node_name slowest_node
    from node_steps ns
    limit 1 over(partition by ns.execution_step order by ns.avg_time desc) ),
     avgs as (
    select execution_step, avg(avg_time) avg_time, sum(cnt) cnt
     from node_steps 
     group by execution_step 
    )
    select sl.execution_step, ag.cnt, ag.avg_time, sl.slowest_time, round(sl.slowest_time*100/ag.avg_time) percent, substr(sl.slowest_node, 7) slowest_node
    from slowest sl
    join avgs ag
    on sl.execution_step = ag.execution_step
    order by ag.cnt desc;
    
    execution_step                 cnt     avg_time           slowest_time       percent  slowest_node  
    -----------------------------  ------  -----------------  -----------------  -------  ------------  
    AbandonPlan                    4 461   0 00:00:00.003505  0 00:00:00.004937  141      node0005      
    BeforePlan                     7 669   0 00:00:00.002294  0 00:00:00.011712  511      node0006      
    BeforePlan:AnalyzeRewrite      7 693   0 00:00:00.000235  0 00:00:00.000552  235      node0003      
    BeforePlan:Parser              7 668   0 00:00:00.000022  0 00:00:00.000025  114      node0006      
    CompilePlan                    5 325   0 00:00:00.008237  0 00:00:00.011510  140      node0001      
    CompilePlan:EEpreexecute       31 534  0 00:00:00.003321  0 00:00:00.004635  140      node0006      
    CompilePlan:ReserveResources   31 539  0 00:00:00.000040  0 00:00:00.000041  103      node0005      
    ExecutePlan                    5 328   0 00:00:01.331176  0 00:00:02.784303  209      node0005      
    ExecutePlan:EEexecute          30 671  0 00:00:00.245275  0 00:00:00.248021  101      node0006      
    ExecutePlan:Finalize           31 538  0 00:00:00.001045  0 00:00:00.003353  321      node0006      
    InitPlan                       4 460   0 00:00:00.014395  0 00:00:00.032827  228      node0005      
    LoadVirtualTableRows           81      0 00:00:00.148014  0 00:00:00.809092  547      node0002      
    Plan                           4 522   0 00:00:00.013784  0 00:00:00.031185  226      node0006      
    PreparePlan                    5 326   0 00:00:00.028760  0 00:00:00.059176  206      node0006      
    PreparePlan:DeserializePlan    26 210  0 00:00:00.006267  0 00:00:00.010280  164      node0003      
    PreparePlan:DistPlanner        30 672  0 00:00:00.000882  0 00:00:00.001190  135      node0006      
    PreparePlan:EEcompile          31 539  0 00:00:00.005058  0 00:00:00.007629  151      node0006      
    PreparePlan:LocalPlan          31 539  0 00:00:00.002685  0 00:00:00.003362  125      node0006      
    PreparePlan:TakeTableLocks     31 536  0 00:00:00.000031  0 00:00:00.000043  139      node0004      
    PreparePlan:TunePlanResources  31 539  0 00:00:00.001176  0 00:00:00.001483  126      node0006      
    SerializePlan                  4 460   0 00:00:00.005467  0 00:00:00.012037  220      node0006      
    
    

    It does require some result analysing but you can figure out that node 6 definitely should be investigated.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file