Performance issue - collecting ideas

dimitri_p · June 23

So I have two 3-node vertica clusters (v25), where one cluster is about 1.5 times more powerful than the other. About 1.5 times more CPU cores, about 1.5 times more memory, about 1.5 time faster network, about 1.5times faster storage.

Both clusters have the same database image deployed: same resource pools, same tables, same projections, same data. Same OS version too.

I have a query that I run on both clusters and the weaker cluster executes it 3 times faster.

I have checked EXPLAIN and it matches to a t (projections, costs, row counts, paths). All statistics are up to date. Resource pools are configured identically albeit parallelexecution is set to AUTO and maxmemory is set to 40%, so different absolute numbers.

I have checked execution_query_profiles counters and the more powerful system just looks uniformly slower in every major counter, nothing is particularly slower, everything is just slower across the board.

Resource consumption shows that the more powerful system spends much more CPU cycles on the query, but the disk spillage is about the same.

Query events are the same (group by don't fit in memory).

There is no data skew (projections used are segmented)

What else do I check? Give me anything that comes to mind!

mosheg · June 23

The short answer is to open a support case, run scrutinize on both and compare the score each cluster get in the Vadvisor report.
VAdvisor scores include deeper diagnostics that are not always visible through manual inspection.
Also, keep in mind that there are a lot of things that can differ across systems, even if on paper they look identical. Hidden OS-level configurations, firmware differences, disk scheduling policies, CPU throttling, and even BIOS power-saving modes can significantly impact performance. So don't rule out lower-level or seemingly irrelevant discrepancies.

Before changing any parameter, always validate the impact in a QA environment first.
Some changes may degrade performance or affect stability depending on workload patterns.

The longer answer breaks the issue down into multiple layers..
1. Reconfirm actual CPU models, clock speeds, NUMA configurations –
Since h/w with more cores isn't always faster for all workloads.
2. Software? (OS version, OS settings as recommended by vertica)
3. Compare vcpuperf, vioperf, vnetperf
4. Explain plans are really identical?
5. Extract data and compare operator side by side in Excel:

select node_name, path_id, operator_name, counter_name, counter_tag, max(counter_value)
from execution_engine_profiles
where counter_name ilike '%(us)%'
group by 1,2,3,4,5
order by 6 desc;

select node_name, path_id, operator_name, counter_name, counter_tag, counter_value
from execution_engine_profiles
where counter_name ilike '%(us)%'
order by node_name, path_id, operator_name, counter_name, counter_tag, counter_value asc;

Check your LOCALE on both systems because the reason for performance degradation can be sorting,
Vertica sorts data and sorting depends on LOCALE.
LOCALE is a parameter that defines the user's language, country, and any special variant preferences, such as collation.
Vertica uses the locale to determine the behavior of various string functions as well for ordering and comparison,
for example, GROUP BY, ORDER BY joins.
To eliminate environment reasons for slowness, compare the following check results on both clusters.
Run as root: /opt/vertica/oss/python3/bin/python3 -m vertica.local_verify
Check and compare your V Catalog size on both clusters:
select node_name, max(catalog_size_in_MB) ::int as END_CATLOG_SIZE_MEM_MB
from(select node_name, trunc((dc_allocation_pool_statistics_by_second."time") ::timestamp,'SS'::varchar(2)) as ts,
sum((dc_allocation_pool_statistics_by_second. total_memory_max_value - dc_allocation_pool_statistics_by_second. free_memory_min_value)) /(1024 * 1024) as catalog_size_in_MB
from dc_allocation_pool_statistics_by_second
group by 1, trunc((dc_allocation_pool_statistics_by_second."time") ::timestamp,'SS'::varchar(2))) foo
group by 1
order by 1;
Check and compare the amount of ROSs in both systems..
For example use the following query to determine the top 50 high number of ROS containers for each projection per node:
SELECT s.node_name, p.table_schema, s.projection_name,
COUNT(DISTINCT s.storage_oid) storage_container_count,
COUNT(DISTINCT partition_key) partition_count,
COUNT(r.rosid) ros_file_count
FROM storage_containers s LEFT OUTER JOIN PARTITIONS p
ON s.storage_oid = p.ros_id JOIN vs_ros r
ON r.delid = s.storage_oid
GROUP BY 1,2,3 ORDER BY 4 DESC LIMIT 50;
Check and compare changed parameters from default on both:
select node_name,parameter_name,current_value,default_value,description
from vs_configuration_parameters
where current_value != default_value;

Check also changed parameters between nodes:
select aa.node_name, aa.parameter_name, aa.current_value,
bb.node_name, bb.current_value
from vs_configuration_parameters aa, vs_configuration_parameters bb
where aa.current_value != bb.current_value
and aa.node_name != bb.node_name
and aa.parameter_name = bb.parameter_name;

Compare on both requests for resources that are rejected by the Resource Manager.
SELECT * FROM resource_rejections;
Modern kernels usually have enough entropy, however with TLS-heavy workloads it can become a problem.
Check the status of both clusters entropy, run the following in bash (on each node):
cat /proc/sys/kernel/random/entropy_avail
If it returns something less than 300, you are on the edge of a problem.
Identify Spin Lock problem symptoms on both with: perf top
Check and compare Spread retransmit with this query :
SELECT a.node_name, max(a.retrans) as max_retrans, max(a.time_interval) as max_interval, max(a.packet_count) as max_packet_count,
max((a.retrans / (a.time_interval / '00:00:01'::interval))::numeric(18,2)) AS max_retrans_per_second
, count(1)
FROM (
SELECT (dc_spread_monitor."time")::timestamp AS "time", dc_spread_monitor.node_name,
(dc_spread_monitor.retrans - lag(dc_spread_monitor.retrans, 1, NULL::int) OVER
(PARTITION BY dc_spread_monitor.node_name ORDER BY (dc_spread_monitor."time")::timestamp)) AS retrans,
(((dc_spread_monitor."time")::timestamp - lag((dc_spread_monitor."time")::timestamp, 1, NULL::timestamp) OVER
(PARTITION BY dc_spread_monitor.node_name ORDER BY (dc_spread_monitor."time")::timestamp))) AS time_interval,
(dc_spread_monitor.packet_sent - lag(dc_spread_monitor.packet_sent, 1, NULL::int) OVER
(PARTITION BY dc_spread_monitor.node_name ORDER BY (dc_spread_monitor."time")::timestamp)) AS packet_count
FROM v_internal.dc_spread_monitor
) a
GROUP BY 1
ORDER BY 5 DESC
limit 10;
If you are using NVME check if you enabled noatime on each of the NVME partitions on both clusters.
Any difference in the Jumbo Frames enablement on ALL NODES on both clusters?
ifconfig | grep mtu
Relevant ONLY for EON --> One can improve query performance by ~10-20% after running update_storage_catalog.
Check if update_storage_catalog is required on one of the clusters:
Run the following query to determine which projections have invalid bundle metadata in the database catalog:
SELECT node_name, projection_name, total_ros_count, ros_without_bundle_info_count
FROM v_monitor.storage_bundle_info_statistics where ros_without_bundle_info_count > 0
ORDER BY projection_name, node_name;

Only for the projections from the query above run the following DO_TM_TASK
The following will write the bundle metadata on all projections in table my_schema.my_table
SELECT DO_TM_TASK ('update_storage_catalog', 'my_schema.my_table');

More RAM may defer spill decisions, ironically increasing latency if memory thresholds are hit unpredictably..
So if one of the clusters has more RAM, a query on that cluster will take longer before it fails to fit in memory and automatically retries.
For example, if a hash table cannot fit in memory, the join operation must spill to disk, causing both the fact and dimension tables to spill. If you know in advance that a specific query will spill, you can add a hint to skip memory hashing:
SELECT /+add_vertica_options(EE,ENABLE_JOIN_SPILL)/ …

To find out which queries retried and why, check the DC_REQUESTS_ISSUED and DC_REQUESTS_RETRIED Data Collector tables.
SELECT transaction_id, statement_id, request FROM dc_requests_issued WHERE is_retry ORDER BY time DESC LIMIT 1;
And:
SELECT transaction_id, statement_id, request, reason_for_retry FROM dc_requests_retried ORDER BY time DESC LIMIT 1;

dimitri_p · June 24

Moshe, thank you very much for such a comprehensive reply, I truly appreciate the effort!!!

While I will certainly be going over the entire list, I wanted to clarify a few things:

These are lab machines built from the same image (both OS and Vertica) so they are as identical as is humanly possible.
I've already run vnet/cpu/io perf tools and the better cluster came out on top in every single case quite proportionally to its more powerful hardware. I've also used sysbench and those results are very much aligned with v...perf.
The difference in performance in my query is not minor or intermittent. It is very obvious and can be reproduced every single time I run the query. That is to say that I am not looking for a needle in a haystack, rather an elephant in the room.

thanks again, I will be updating this thread if I find the solution!

dimitri_p · July 7

Thanks to Moshe's excellent write-up I was able to find issues by using this query

select node_name, path_id, operator_name, counter_name, counter_tag, max(counter_value)
from execution_engine_profiles
where counter_name ilike '%(us)%'
group by 1,2,3,4,5
order by 6 desc;

this revealed lots of high waits of all kinds: input queue wait, response wait, network wait - on the slower cluster only. Then I ran vnetperf again and found that I overlooked high latency and high clock skew between two of the nodes! These are now my main suspects. Once the issue is fixed, I will update the thread to confirm whether that was the root cause of the slowness.

Interestingly enough, vnetperf's rtt latency doesn't match the latency measured with netperf or qperf (or traceroute or ping). I am assuming vnetperf is affected by the clock skew and reports incorrect results. I am getting latency circa 400ms in vnetperf and 50-100 elsewhere. A very clear and repeatable difference.

dimitri_p · July 18

@mosheg if you read this, I would like to pick your brain some more if that's ok with you. While our guys are looking into the network performance differences between the two clusters, I decided to try and isolate the problem from the network.

I created a projection segmented by a constant, which way every record went to the same node (node2). Then I ran the query against the projection (it is a single-table query) on both systems and the relative dominance of the weaker cluster remained the same.
But then I did something even more interesting. I stopped one node (out of three) on each cluster and tried the same one-node query again. And guess what? The query duration on the weaker cluster did not change, as you would expect (because all the data is on one node, what would it matter, how many other nodes are down). But the query duration on the stronger cluster increased 3 times!

Knowing this, is there anything else you can suspect to be wrong with the stronger cluster?

We're Moving!

Create My New Community Account Now

Performance issue - collecting ideas

Answers

Leave a Comment