Partitioning tables in Vertica ?.
Hello,
I need to partition one of our tables. The table currently has approximately few millions of rows and is expected to grow fast. The main reason for partitioning is to improve the query performance. Almost all the queries on this table use below 2 fields, 1. Primary key field - MessageId 2. Timestamp field - Messagetimestamp I have worked with other databases and have created time based partitions. I need some inputs from experts on this forum. a. Can I use primary key field for partitioning tables in Vertica? b. Should I use some kind of a function (Mod) on the MessageId to reduce the number of partition keys. I have read about too many partitions causing ROS container issues?
https://creditcardsupportx.com
https://creditcardsupportx.com/barnes-and-noble-barclay-mastercard
https://creditcardsupportx.com/pep-boys-credit-card
thanks
jackyjoy
Answers
@jackyjoy123
Before we go into partitioning tables in Vertica, can you please provide some additional information on your Vertica cluster and query performance problem you are trying to solve.
It is true that partition pruning will improve query-performance, in Vertica I would use table-partitioning as one of the last resort. Too many partition will result in ROS pushback. Recommended number of partitions is around ~40. For data-management partitioning is helpful.
For time-stamp based data, I have found Hierarchical partitioning most effective.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Partitions/HierarchicalPartitioning.htm
Normally you don't want to use primary key columns for partitioning but for segmentation.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Partitions/PartitioningAndSegmentingData.htm
Typically 90% of query performance SLA in Vertica can be met by designing proper "projections (segmentation and ordering)" on the table.
Some of the following links will be helpful.
You can use DBD to recommend comprehensive and/or query specific projections.
https://www.vertica.com/blog/query-tuning-vertica-dos-donts/
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/OptimizingQueryPerformance.htm?TocPath=Analyzing Data|Query Optimization|_____0
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/InitialProcessForImprovingQueryPerformance.htm?TocPath=Analyzing Data|Query Optimization|_____1
Furthermore, Vertica performs Container pruning even on non-partitioned tables when appropriate.
https://forum.vertica.com/discussion/240698/storage-container-optimization-for-tables-not-partitioned
Hope this helps.
@jackyjoy123
Can I use primary key field for partitioning tables in Vertica?
For instance, if you have a table Orders with the field OrderDate you would most likely partition based on the month and year of OrderDate. When records age out and are no longer relevant you can move those partitions off to an archive table or database so they are no longer processed.
Partitioning will work with pretty much any field, but in order for it to work WELL the field(s) you partition on should be used in most, if not all, of your queries. If you don't include your partition keys then you will get essentially an expensive table scan that goes across multiple tables (partitions).
You can also check below link, optimizing vertica data with partitions and it has very good explanation about partition with an example.
https://www.vertica.com/blog/optimizing-vertica-data-with-partitions/
You can easily combine partitioning by the element of time (at a reasonable granularity), even if the element of time is a constituent of the primary key, and segmenting by the primary key itself.
Here is a typical Slowly Changing Dimension (SCD) design that I use:
SCD tables have two ways of being uniquely identified:
If you don't use the surrogate key approach, nothing would speak against a design like this:
And remenber: if you really segment by the primary key, any query filtering by the primary key will prune all nodes except the one where your PK value resides from participating in the query. In a 12-node cluster, eleven nodes will turn their thumbs while one does all the work.