Query to find slowest node
Hi!
In Vertica presentations back from 2020
query
SELECT 1
had been suggested as a way to detect a slow node in cluster (slide 28).
Time flies, Vertica going ahead and more efficient, and SELECT 1 now being executed on single initiator node and no longer serve suggested purpose of detecting slow node.
Can you suggest replacement - how I can "relatively easily" detect slowest node in large cluster?
Thank you
Sergey
Best Answer
-
s_crossman Vertica Employee Employee
Sergey,
You may be able to narrow in on what the issue is on that bad host using the Vertica perf utilities. vcpuperf, vnetperf, and vioperf. Run each on a good host and then the bad host. Compare outputs to see if the bad host shows any degredation of cpu, network, or disk i/o. Details on these utities can be found in the Validation Scripts section of the docs. They aren't easy to read individually, but doing comparisons for good and bad hosts might show where the issue lies.
https://docs.vertica.com/12.0.x/en/setup/set-up-on-premises/install-using-command-line/validation-scripts/Other areas that might be different would be things like cpu frequency scaling, disk readahead, I/O scheduling, and other pre install items. Those are covered in https://docs.vertica.com/12.0.x/en/setup/set-up-on-premises/before-you-install/manually-configured-os-settings/
I hope it helps,
0
Answers
I'm wondering under what kind of conditions one node would be slower than other nodes? All nodes should configured the same, and physically be the same, so what are possible reasons why one node could be slower than other nodes?
In theory, any select from a replicated (unsegmented) table, executed on each individual node, should give you an easy way to compare performance numbers between nodes. That would be the most consistent way of testing node performance, in my mind.
Yes, in perfect world all nodes should be born equal...
There are plenty of reasons why node can get slow.
Just blindly running same script from presentation on my TST cluster show something is going on with node 6:
./node_perf tst
IP address masked
Timing is on.
1
Time: First fetch (1 row): 3.799 ms. All rows formatted: 3.808 ms
IP address masked
Timing is on.
1
Time: First fetch (1 row): 3.709 ms. All rows formatted: 3.717 ms
IP address masked
Timing is on.
1
Time: First fetch (1 row): 3.738 ms. All rows formatted: 3.746 ms
IP address masked
Timing is on.
1
Time: First fetch (1 row): 3.969 ms. All rows formatted: 3.978 ms
IP address masked
Timing is on.
1
Time: First fetch (1 row): 3.619 ms. All rows formatted: 3.626 ms
IP address masked
Timing is on.
1
Time: First fetch (1 row): 27.104 ms. All rows formatted: 27.115 ms
Seems to be, running SELECT 1 on each node in cluster is the easiest way for quick and dirty check of cluster sanity.
It does look like recommendations from presentations stands.
Thanks for answer!
Thanks, that is a correct way of action if presence of problematic host detected.
In addition, to find the slower node for a specific query
Create a file, starting with your canary query, like in the following example:
That is a great idea!
Vertica already have a bunch of stats on all queries in dc_query_executions, we can go against all queries:
It does require some result analysing but you can figure out that node 6 definitely should be investigated.