Query to find slowest node

Sergey_Cherepan_1 · June 2023

Hi!

In Vertica presentations back from 2020

https://www.vertica.com/wp-content/uploads/2020/04/Optimizing-Query-Performance-and-Resource-Pool-Tuning.pdf

query
SELECT 1
had been suggested as a way to detect a slow node in cluster (slide 28).

Time flies, Vertica going ahead and more efficient, and SELECT 1 now being executed on single initiator node and no longer serve suggested purpose of detecting slow node.

Can you suggest replacement - how I can "relatively easily" detect slowest node in large cluster?

Thank you

Sergey

s_crossman · June 2023

Sergey,

You may be able to narrow in on what the issue is on that bad host using the Vertica perf utilities. vcpuperf, vnetperf, and vioperf. Run each on a good host and then the bad host. Compare outputs to see if the bad host shows any degredation of cpu, network, or disk i/o. Details on these utities can be found in the Validation Scripts section of the docs. They aren't easy to read individually, but doing comparisons for good and bad hosts might show where the issue lies.
https://docs.vertica.com/12.0.x/en/setup/set-up-on-premises/install-using-command-line/validation-scripts/

Other areas that might be different would be things like cpu frequency scaling, disk readahead, I/O scheduling, and other pre install items. Those are covered in https://docs.vertica.com/12.0.x/en/setup/set-up-on-premises/before-you-install/manually-configured-os-settings/

I hope it helps,

Vertica_Curtis · June 2023

I'm wondering under what kind of conditions one node would be slower than other nodes? All nodes should configured the same, and physically be the same, so what are possible reasons why one node could be slower than other nodes?

In theory, any select from a replicated (unsegmented) table, executed on each individual node, should give you an easy way to compare performance numbers between nodes. That would be the most consistent way of testing node performance, in my mind.

Sergey_Cherepan_1 · June 2023

Yes, in perfect world all nodes should be born equal...
There are plenty of reasons why node can get slow.
Just blindly running same script from presentation on my TST cluster show something is going on with node 6:

./node_perf tst
IP address masked
Timing is on.
1
Time: First fetch (1 row): 3.799 ms. All rows formatted: 3.808 ms
IP address masked
Timing is on.
1
Time: First fetch (1 row): 3.709 ms. All rows formatted: 3.717 ms
IP address masked
Timing is on.
1
Time: First fetch (1 row): 3.738 ms. All rows formatted: 3.746 ms
IP address masked
Timing is on.
1
Time: First fetch (1 row): 3.969 ms. All rows formatted: 3.978 ms
IP address masked
Timing is on.
1
Time: First fetch (1 row): 3.619 ms. All rows formatted: 3.626 ms
IP address masked
Timing is on.
1
Time: First fetch (1 row): 27.104 ms. All rows formatted: 27.115 ms

Seems to be, running SELECT 1 on each node in cluster is the easiest way for quick and dirty check of cluster sanity.
It does look like recommendations from presentations stands.

Thanks for answer!

Sergey_Cherepan_1 · June 2023

Thanks, that is a correct way of action if presence of problematic host detected.

mosheg · June 2023

In addition, to find the slower node for a specific query
Create a file, starting with your canary query, like in the following example:

\o /dev/null 
your_canary_query_here_is_the_first_query_in_the_file
\o

select node_name,execution_step,max(completion_time-time) as duration
from dc_query_executions
where transaction_id=current_trans_id() and statement_id = 1
group by node_name,execution_step
order by execution_step, duration desc;

Sergey_Cherepan_1 · June 2023

That is a great idea!

Vertica already have a bunch of stats on all queries in dc_query_executions, we can go against all queries:

with /*+ ENABLE_WITH_CLAUSE_MATERIALIZATION*/ 
 node_steps as (
select
  execution_step,
  node_name,
  avg(completion_time - time) as avg_time,
  count(*) cnt
from
  dc_query_executions
group by
  execution_step, node_name ), 
 slowest as (
select ns.execution_step, ns.avg_time slowest_time, ns.node_name slowest_node
from node_steps ns
limit 1 over(partition by ns.execution_step order by ns.avg_time desc) ),
 avgs as (
select execution_step, avg(avg_time) avg_time, sum(cnt) cnt
 from node_steps 
 group by execution_step 
)
select sl.execution_step, ag.cnt, ag.avg_time, sl.slowest_time, round(sl.slowest_time*100/ag.avg_time) percent, substr(sl.slowest_node, 7) slowest_node
from slowest sl
join avgs ag
on sl.execution_step = ag.execution_step
order by ag.cnt desc;

execution_step                 cnt     avg_time           slowest_time       percent  slowest_node  
-----------------------------  ------  -----------------  -----------------  -------  ------------  
AbandonPlan                    4 461   0 00:00:00.003505  0 00:00:00.004937  141      node0005      
BeforePlan                     7 669   0 00:00:00.002294  0 00:00:00.011712  511      node0006      
BeforePlan:AnalyzeRewrite      7 693   0 00:00:00.000235  0 00:00:00.000552  235      node0003      
BeforePlan:Parser              7 668   0 00:00:00.000022  0 00:00:00.000025  114      node0006      
CompilePlan                    5 325   0 00:00:00.008237  0 00:00:00.011510  140      node0001      
CompilePlan:EEpreexecute       31 534  0 00:00:00.003321  0 00:00:00.004635  140      node0006      
CompilePlan:ReserveResources   31 539  0 00:00:00.000040  0 00:00:00.000041  103      node0005      
ExecutePlan                    5 328   0 00:00:01.331176  0 00:00:02.784303  209      node0005      
ExecutePlan:EEexecute          30 671  0 00:00:00.245275  0 00:00:00.248021  101      node0006      
ExecutePlan:Finalize           31 538  0 00:00:00.001045  0 00:00:00.003353  321      node0006      
InitPlan                       4 460   0 00:00:00.014395  0 00:00:00.032827  228      node0005      
LoadVirtualTableRows           81      0 00:00:00.148014  0 00:00:00.809092  547      node0002      
Plan                           4 522   0 00:00:00.013784  0 00:00:00.031185  226      node0006      
PreparePlan                    5 326   0 00:00:00.028760  0 00:00:00.059176  206      node0006      
PreparePlan:DeserializePlan    26 210  0 00:00:00.006267  0 00:00:00.010280  164      node0003      
PreparePlan:DistPlanner        30 672  0 00:00:00.000882  0 00:00:00.001190  135      node0006      
PreparePlan:EEcompile          31 539  0 00:00:00.005058  0 00:00:00.007629  151      node0006      
PreparePlan:LocalPlan          31 539  0 00:00:00.002685  0 00:00:00.003362  125      node0006      
PreparePlan:TakeTableLocks     31 536  0 00:00:00.000031  0 00:00:00.000043  139      node0004      
PreparePlan:TunePlanResources  31 539  0 00:00:00.001176  0 00:00:00.001483  126      node0006      
SerializePlan                  4 460   0 00:00:00.005467  0 00:00:00.012037  220      node0006

It does require some result analysing but you can figure out that node 6 definitely should be investigated.

We're Moving!

Create My New Community Account Now

Query to find slowest node

Best Answer

Answers

Leave a Comment