slower querying speed after 1Billion records

prad33p2301 · March 2024

Hi I have a table which now contains > 1Billion records equallly distributed on a 3 node vertica cluster.

More details :-
1. Table holds data for 180 days and data is partitioned by a date only column
2. data for current day (about 15-20M records) is loaded into the table consistently in batches of 10k records at once using COPY statement
3. Zero deletes / update queries are run on the table
4. total table size is now about 2.5TB
5. all the queries on the tables is based on a column (which is the column in where clause of the query) which is part of the projections order by clause

when the table had about 250M records in total , queries ran under 200ms every single time , but now the same queries are going beyond 2-3secs (sometimes 9-10seconds!) most of the times

when I pick the slow running queries and run them with PROFILE keyword I dont see the slowness .

What could be the reason for such change in performance? Is it because of huge data volume ?

VValdar · March 2024

Hi Pradeep,

I would challenge point 2, Vertica is very efficient in loading lots of data, batch of 10k is a 20th century good practice.
Is it possible to show your DDL (relevant columns only + sanitized if needed) and the query that feels slow?

But general idea is are you using the partition column into your where clause?

prad33p2301 · March 2024

HI @VValdar ,

Thanks.

The reason for doing it in batches of 10k is because the upstream system which generates the day doesnt generate all 15M -20M rows at once but generates data throught the day from 00:00 - 23:59

Table contains about 70 columns. One of the columns is a Timestamp column in the format yyyy-mm-dd hh:mm:ss format , the datepart of this column is what we have used for the date only column and the table is partitioned on the same (we also are planning to use the same partition key to delete data older than 180 days)

At any point in time the table contains currentday - 180 days of data
for ex :- as of today 13 Mar 2024 it contains data all the way back until 15 Oct 2023
each row has a varchar(32) column which contains an MD5 hash value which we generate internally and all the queries are based on this varchar(32) column so the column in the where clause is not partition key, but its another column which is in the projections order by clause

VValdar · March 2024

Ok for the first point it makes sense.
Partition column seems also good.

The issue when not using the partition column is that the database makes one ROS per partition, and if you don't use the partition column in the where clause it has to open all ROS to perform the where filter.

Also, is your MD5 value the first column of the order by projection? It helps if it's your only critera.
And finally, if your query for one hash value return some rows (but not 100k), is this column used as the segmentation critera? This also helps to run the query on one node versus the whole cluster, which increase the throughput.

prad33p2301 · March 2024

Hi

the column containing MD5 value is the first column in the order by clause , and is also part of segmented clause.

regarding the one no. of ROS containers when I run below query

SELECT COUNT (DISTINCT ros_id) NumROS, node_name FROM PARTITIONS
WHERE projection_name ilike '%table_name%' GROUP BY node_name ORDER BY node_name;

I see many ros containers

If I am not using partition column in where clause does it have to open all the above ROS containers ? Is that what could be causing delays now? How can I optimize this to handle 3-4Billion records?

lf I query v_monitor.query_plan_profiles table with transactionId and statement_id I see the "Storage Access" part of the statement takes up all the time

SELECT path_id, path_line::VARCHAR(68), running_time FROM v_monitor.query_plan_profiles
WHERE transaction_id=49539595902267610 AND statement_id=193 ORDER BY path_id, path_line_index;

also the same when I query execution_engine_profiles table

SELECT
    operator_name,
    path_id,
    SUM(counter_value)
FROM
    execution_engine_profiles
WHERE
   transaction_id=49539595902267610 AND statement_id=193
AND counter_name ILIKE 'execution%'
GROUP BY
    operator_name,
    path_id
ORDER BY
    3 DESC LIMIT 20;

Sergey_Cherepan_1 · March 2024

This is how I understand your table:
1) you have column with type date, and partition on date (I assume you did a standard PARTITION GROUP BY), You are using it to drop data older than 180 days.
2) you have unique MD5 hash generated per each row, and it is in varchar(32) column - I assume your MD5 hash has constant length and fill all of 32 chars in column.
3) Table has MD5 column first in ORDER BY, and MD5 column is part of segmentation key.
4) any query is fetching on MD5 value.

First, there is no point to add more columns to ORDER BY after unique column, they will not change order of data. You have unique MD5 column first in ORDER BY, rest of columns in ORDER BY are just a decoration in your table syntax.

Same pretty much true for SEGMENTATION KEY, if you are including unique MD5 column, rest of columns are not necessary in SEGMENTATION KEY.

If every query include MD5 value, a very straightforward way to improve performance and make it more stable is to add MD5_INT column INT datatype with MD5 value as INT. Of course, MD5_INT column is 8 bytes and can hold only 8 bytes out of your MD5 (for example only 8 first bytes).

Other approach to create MD5_INT value is to take has(MD5), that will generate unique INT 8-byte value.

You will need to change your table ORDER BY and place MD5_INT first. It does not matter what you will place after MD5_INT - it is "practically" unique.

Segmentation key can remain same as you have now.

Your queries should have two conditions - in addition to existing condition on MD5 you will add condition on MD5_INT. You can write SQL function that will extract substring from MD5 and convert to INT to get value for MD5_INT, and use it in your SQL.

I believe that MD5_INT would be enough to be used in queries because it is sufficiently unique, but for piece of mind you can add existing condition on MD5 to your queries.

Idea here is that query by INT is by faster than by long varchar string. As a bonus, you will get by far more stable execution time.

We're Moving!

Create My New Community Account Now

slower querying speed after 1Billion records

Answers

Leave a Comment