Help optimizing query speed
I am running Vertica 7.2 on Ubuntu with 16 GB of RAM and 8 cores.
I have loaded 400 million rows of data, 69 columns wide into a single table.
I have partitioned the table by fiscal quarter.
The types of queries I am running are like:
SELECT avg(t1.age), t1.gender
FROM
(SELECT
MIN(age),
MIN(gender)
FROM analytics.data
WHERE country='JPN' and dt BETWEEN '2016-01-01' AND '2016-02-15'
GROUP BY customer_id) t1
GROUP BY gender;
The speed is ok, but no faster than Teradata.
I would like to create a partition for every day (about 5,000 days in total), but Vertica says the max is 1,024 partitions, and doesn't recommend more than 10-20.
I would like to create a few projections, but my queries never use the same columns to filter on - it's always a different combination of columns (ie country, product type, channel, etc).
What can I do to improve the speed?
thanks
Comments
There is no partition pruning in Vertica. Partitions in Vertica are useful only for pruging or moving an old data.
The only way to optimize your query is to add the right sort and segmetation in projection. Do not forget about encoding also.
Vertica does do partition pruning (elimination).
My example isn't actually partitioned, but it shows that ROS containers are pruned based on predicate values, regardless of the partition expression.
Elimination is not a pruning. It means that Vertica doesn't scan ROS containers that doesn't comtain data that satisfeis predicate. But ROS could contain more than one partition. Elimination is close to classic pruning in traditional DBs, but its not.
And not scanning any ROS containers that don't satisfy the predicates is partition pruning. A ROS may contain more than one partition, but if you forcefully combine partitions into a single ROS container, then that's a tradeoff you're making - that pruning/elimination won't be as effective.
Thank you Filin, Sharon.
With regards to projections, I still don't know how to create a projection if the columns in my WHERE clause are never the same query-to-query. Do I just create 100 projections for the most common ones?
thanks,
imran
How many nodes do you have in your cluster?
What are the most resource intensive queries based on this table?
segmentation depends on join conditions (actualy order too)
In your case you should try to order projection like this:
order by gender, country, customer (from low to high cardinality)
And it's desirable to use encoding on you predicate fields
It's a single-node dev cluster.
There is no joining, as all the data is in a single table.
The slowest part of the query is the GROUP BY customer_id.
Is the only way to speed that up to add it to the projection?
thanks!