Help optimizing query speed

skunkwerk · May 2016

I am running Vertica 7.2 on Ubuntu with 16 GB of RAM and 8 cores.

I have loaded 400 million rows of data, 69 columns wide into a single table.

I have partitioned the table by fiscal quarter.

The types of queries I am running are like:

SELECT avg(t1.age), t1.gender

FROM

(SELECT

MIN(age),

MIN(gender)

FROM analytics.data

WHERE country='JPN' and dt BETWEEN '2016-01-01' AND '2016-02-15'

GROUP BY customer_id) t1

GROUP BY gender;

The speed is ok, but no faster than Teradata.

I would like to create a partition for every day (about 5,000 days in total), but Vertica says the max is 1,024 partitions, and doesn't recommend more than 10-20.

I would like to create a few projections, but my queries never use the same columns to filter on - it's always a different combination of columns (ie country, product type, channel, etc).

What can I do to improve the speed?

thanks

FiliN · May 2016

There is no partition pruning in Vertica. Partitions in Vertica are useful only for pruging or moving an old data.

The only way to optimize your query is to add the right sort and segmetation in projection. Do not forget about encoding also.

Sharon_Cutter · May 2016

FiliN wrote:
There is no partition pruning in Vertica.

Vertica does do partition pruning (elimination).

Sample Script:

drop table test cascade;

create table test (c1 integer);
insert /*+direct*/ into test values (1);
insert /*+direct*/ into test values (2);
insert /*+direct*/ into test values (3);
insert /*+direct*/ into test values (4);
insert /*+direct*/ into test values (5);
commit;

select * from test where c1=3;

select transaction_id, statement_id, event_category, event_type, event_details from query_events where transaction_id=current_trans_id();


Output:

DROP TABLE
CREATE TABLE
 OUTPUT 
--------
      1
(1 row)

 OUTPUT 
--------
      1
(1 row)

 OUTPUT 
--------
      1
(1 row)

 OUTPUT 
--------
      1
(1 row)

 OUTPUT 
--------
      1
(1 row)

COMMIT
 c1 
----
  3
(1 row)

  transaction_id   | statement_id | event_category |      event_type       |                         event_details                         
-------------------+--------------+----------------+-----------------------+---------------------------------------------------------------
 45035996273963554 |            1 | OPTIMIZATION   | NO HISTOGRAM          | No histogram for public.test.c1
 45035996273963554 |            1 | EXECUTION      | PARTITIONS_ELIMINATED | Using only 1 stores out of 5 for projection public.test_super
(2 rows)

Sharon_Cutter · May 2016

My example isn't actually partitioned, but it shows that ROS containers are pruned based on predicate values, regardless of the partition expression.

FiliN · May 2016

Elimination is not a pruning. It means that Vertica doesn't scan ROS containers that doesn't comtain data that satisfeis predicate. But ROS could contain more than one partition. Elimination is close to classic pruning in traditional DBs, but its not.

Sharon_Cutter · May 2016

And not scanning any ROS containers that don't satisfy the predicates is partition pruning. A ROS may contain more than one partition, but if you forcefully combine partitions into a single ROS container, then that's a tradeoff you're making - that pruning/elimination won't be as effective.

skunkwerk · May 2016

Thank you Filin, Sharon.

With regards to projections, I still don't know how to create a projection if the columns in my WHERE clause are never the same query-to-query. Do I just create 100 projections for the most common ones?

thanks,

imran

FiliN · May 2016

How many nodes do you have in your cluster?

What are the most resource intensive queries based on this table?

segmentation depends on join conditions (actualy order too)

In your case you should try to order projection like this:

order by gender, country, customer (from low to high cardinality)

And it's desirable to use encoding on you predicate fields

skunkwerk · June 2016

It's a single-node dev cluster.

There is no joining, as all the data is in a single table.

The slowest part of the query is the GROUP BY customer_id.

Is the only way to speed that up to add it to the projection?

thanks!

We're Moving!

Create My New Community Account Now

Help optimizing query speed

Comments

Leave a Comment