Options

Partitioning tables in Vertica ?.

jackyjoy123jackyjoy123 Vertica Customer
edited January 2021 in General Discussion

Hello,

I need to partition one of our tables. The table currently has approximately few millions of rows and is expected to grow fast. The main reason for partitioning is to improve the query performance. Almost all the queries on this table use below 2 fields, 1. Primary key field - MessageId 2. Timestamp field - Messagetimestamp I have worked with other databases and have created time based partitions. I need some inputs from experts on this forum. a. Can I use primary key field for partitioning tables in Vertica? b. Should I use some kind of a function (Mod) on the MessageId to reduce the number of partition keys. I have read about too many partitions causing ROS container issues?
https://creditcardsupportx.com
https://creditcardsupportx.com/barnes-and-noble-barclay-mastercard
https://creditcardsupportx.com/pep-boys-credit-card
thanks
jackyjoy

Answers

  • Options
    Sudhakar_BSudhakar_B Vertica Customer ✭✭
    edited January 2021

    @jackyjoy123
    Before we go into partitioning tables in Vertica, can you please provide some additional information on your Vertica cluster and query performance problem you are trying to solve.
    It is true that partition pruning will improve query-performance, in Vertica I would use table-partitioning as one of the last resort. Too many partition will result in ROS pushback. Recommended number of partitions is around ~40. For data-management partitioning is helpful.
    For time-stamp based data, I have found Hierarchical partitioning most effective.
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Partitions/HierarchicalPartitioning.htm

    Normally you don't want to use primary key columns for partitioning but for segmentation.
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Partitions/PartitioningAndSegmentingData.htm

    1. You said the table has few-million rows and expected to grow fast. 10 million, 50 million, 500 million? What is the expected growth rate?
    2. How many nodes in your cluster? What is the approximate size of each node # of core, and RAM?
    3. What is the version of Vertica you are using?

    Typically 90% of query performance SLA in Vertica can be met by designing proper "projections (segmentation and ordering)" on the table.
    Some of the following links will be helpful.
    You can use DBD to recommend comprehensive and/or query specific projections.
    https://www.vertica.com/blog/query-tuning-vertica-dos-donts/
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/OptimizingQueryPerformance.htm?TocPath=Analyzing Data|Query Optimization|_____0
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/InitialProcessForImprovingQueryPerformance.htm?TocPath=Analyzing Data|Query Optimization|_____1

    Furthermore, Vertica performs Container pruning even on non-partitioned tables when appropriate.
    https://forum.vertica.com/discussion/240698/storage-container-optimization-for-tables-not-partitioned
    Hope this helps.

  • Options
    Nimmi_guptaNimmi_gupta - Select Field - Employee

    @jackyjoy123
    Can I use primary key field for partitioning tables in Vertica?

    You should choose a partitioning expression based on time (date field).

    For instance, if you have a table Orders with the field OrderDate you would most likely partition based on the month and year of OrderDate. When records age out and are no longer relevant you can move those partitions off to an archive table or database so they are no longer processed.
    Partitioning will work with pretty much any field, but in order for it to work WELL the field(s) you partition on should be used in most, if not all, of your queries. If you don't include your partition keys then you will get essentially an expensive table scan that goes across multiple tables (partitions).
    You can also check below link, optimizing vertica data with partitions and it has very good explanation about partition with an example.
    https://www.vertica.com/blog/optimizing-vertica-data-with-partitions/

  • Options
    marcothesanemarcothesane - Select Field - Administrator

    You can easily combine partitioning by the element of time (at a reasonable granularity), even if the element of time is a constituent of the primary key, and segmenting by the primary key itself.

    Here is a typical Slowly Changing Dimension (SCD) design that I use:

    CREATE TABLE scd.d_custym_scd (
      cust_key     INT NOT NULL
        DEFAULT HASH(cust_id,cust_from_dt)
    , cust_id      INT NOT NULL
    , cust_from_dt DATE NOT NULL
    , cust_to_dt   DATE NOT NULL
    , cust_is_hist BOOLEAN NOT NULL
    , cust_ins_ts  TIMESTAMP NOT NULL
    , cust_upd_ts  TIMESTAMP NOT NULL
    , cust_fname   VARCHAR(30)
    , cust_lname   VARCHAR(30)
    , cust_phoneno VARCHAR(30)
    , cust_loy_lvl INT
    , cust_org_id  INT
    , CONSTRAINT d_cust_scd_pk PRIMARY KEY(cust_key)
    )
    SEGMENTED BY HASH(cust_key) ALL NODES
    PARTITION BY YEAR(cust_from_dt)*100+MONTH(cust_from_dt)
    ;
    

    SCD tables have two ways of being uniquely identified:

    • the natural/business identifier and the valid-from date/timestamp - constraint would be UNIQUE -
    • the single-column integer surrogate key - this one's constraint is then the PRIMARY KEY

    If you don't use the surrogate key approach, nothing would speak against a design like this:

    CREATE TABLE l_cust (
      cust_id      INT NOT NULL
    , cust_from_dt DATE NOT NULL
    , cust_to_dt   DATE NOT NULL
    , cust_is_hist BOOLEAN NOT NULL
    , cust_ins_ts  TIMESTAMP NOT NULL
    , cust_upd_ts  TIMESTAMP NOT NULL
    , cust_fname   VARCHAR(30)
    , cust_lname   VARCHAR(30)
    , cust_phoneno VARCHAR(30)
    , cust_loy_lvl INT
    , cust_org_id  INT
    , CONSTRAINT l_cust_pk PRIMARY KEY(cust_id,cust_from_dt)
    )
    SEGMENTED BY HASH(cust_id) ALL NODES
    PARTITION BY YEAR(cust_from_dt)*100+MONTH(cust_from_dt)
    ;
    

    And remenber: if you really segment by the primary key, any query filtering by the primary key will prune all nodes except the one where your PK value resides from participating in the query. In a 12-node cluster, eleven nodes will turn their thumbs while one does all the work.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file