Help optimizing query speed

I am running Vertica 7.2 on Ubuntu with 16 GB of RAM and 8 cores.

 

I have loaded 400 million rows of data, 69 columns wide into a single table.

 

I have partitioned the table by fiscal quarter.

 

The types of queries I am running are like:

 

SELECT avg(t1.age), t1.gender

FROM

    (SELECT

    MIN(age),

    MIN(gender)

    FROM analytics.data

    WHERE country='JPN' and dt BETWEEN '2016-01-01' AND '2016-02-15'

    GROUP BY customer_id) t1

GROUP BY gender;

 

The speed is ok, but no faster than Teradata.

 

I would like to create a partition for every day (about 5,000 days in total), but Vertica says the max is 1,024 partitions, and doesn't recommend more than 10-20.

 

I would like to create a few projections, but my queries never use the same columns to filter on - it's always a different combination of columns (ie country, product type, channel, etc).

 

What can I do to improve the speed?

 

thanks

Comments

  • There is no partition pruning in Vertica. Partitions in Vertica are useful only for pruging or moving an old data.

    The only way to optimize your query is to add the right sort and segmetation in projection. Do not forget about encoding also.

     


  • FiliN wrote:

    There is no partition pruning in Vertica.

     



    Vertica does do partition pruning (elimination).  

     

    Sample Script:

    drop table test cascade;

    create table test (c1 integer);
    insert /*+direct*/ into test values (1);
    insert /*+direct*/ into test values (2);
    insert /*+direct*/ into test values (3);
    insert /*+direct*/ into test values (4);
    insert /*+direct*/ into test values (5);
    commit;

    select * from test where c1=3;

    select transaction_id, statement_id, event_category, event_type, event_details from query_events where transaction_id=current_trans_id();


    Output:

    DROP TABLE
    CREATE TABLE
    OUTPUT
    --------
    1
    (1 row)

    OUTPUT
    --------
    1
    (1 row)

    OUTPUT
    --------
    1
    (1 row)

    OUTPUT
    --------
    1
    (1 row)

    OUTPUT
    --------
    1
    (1 row)

    COMMIT
    c1
    ----
    3
    (1 row)

    transaction_id | statement_id | event_category | event_type | event_details
    -------------------+--------------+----------------+-----------------------+---------------------------------------------------------------
    45035996273963554 | 1 | OPTIMIZATION | NO HISTOGRAM | No histogram for public.test.c1
    45035996273963554 | 1 | EXECUTION | PARTITIONS_ELIMINATED | Using only 1 stores out of 5 for projection public.test_super
    (2 rows)

     

  • My example isn't actually partitioned, but it shows that ROS containers are pruned based on predicate values, regardless of the partition expression.

     

     

  • Elimination is not a pruning. It means that Vertica doesn't scan ROS containers that doesn't comtain data that satisfeis predicate. But ROS could contain more than one  partition. Elimination is close to classic pruning in traditional DBs, but its not.

  • And not scanning any ROS containers that don't satisfy the predicates is partition pruning.  A ROS may contain more than one partition, but if you forcefully combine partitions into a single ROS container, then that's a tradeoff you're making - that pruning/elimination won't be as effective.

     

     

  • Thank you Filin, Sharon.

     

    With regards to projections, I still don't know how to create a projection if the columns in my WHERE clause are never the same query-to-query.  Do I just create 100 projections for the most common ones?

     

    thanks,

    imran

  • How many nodes do you have in your cluster?

    What are the most resource intensive queries based on this table?

    segmentation depends on join conditions (actualy order too)

     

    In your case you should try to order projection like this:

     

    order by gender, country, customer  (from low to high cardinality)

    And it's desirable to use encoding on you predicate fields

     

     

  • It's a single-node dev cluster.

    There is no joining, as all the data is in a single table.

     

    The slowest part of the query is the GROUP BY customer_id.

    Is the only way to speed that up to add it to the projection?

     

    thanks!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file