Specifying length of VARCHAR

edited December 2020 in General Discussion

In Vertica is there a benefit to frugally specifying length limits in VARCHAR columns? Does this allocate space, such that E.g. VARCHAR(5) for ZIP codes is better than VARCHAR? Conversely, is there a downside to specifying large values, e.g. VARCHAR(65535) when the string length of possible values might be larger than 80 but the upper limit not known?

By contrast, I understand that in Postgres there's no advantage to specifying the length https://dba.stackexchange.com/questions/20974/should-i-add-an-arbitrary-length-limit-to-varchar-columns

Answers

  • Sudhakar_BSudhakar_B ✭✭
    edited December 2020

    Prior to version 9.3, yes there is a performance penalty in Vertica for specifying "large" VARCHAR length. Primarily because, though the disk storage is used only the length of the data, Memory allocation happens for full length of the column, even if the data is much smaller.

    There are two way the performance penalty manifests.
    If the VARCHAR(500) (or any large value) is used in JOIN condition using HASH join. I have seen this happening in production systems.
    The HASH table will be unnecessarily large resulting OOM or DISK_SPILLING. Vertica 9.3 introduced JFMT Hint and JoinDefaultTupleFormat (session, database level) configuration parameter that mitigates this problem effectively. Though the default value of the parameter is still "fixed" (not variable).
    https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/AnalyzingData/Optimizations/VariableLengthDataJoin.htm?zoom_highlight=JoinDefaultTupleFormat

    The other way performance penalty manifest is apart from allocating larger memory foot print if Vertica has to transmit the result to the client, it will use the full length buffers. I have seen long varchar (2.5 Mil) where as the text data is is in the rage of 10-100K.
    This can be mitigated by casting/substring when selecting data from such columns.

    IMHO, also anytime if the column is used in segmentation, order by, group by, distinct etc, it will be efficient to specify reasonable/expected length for Varchar column.

  • Jim_KnicelyJim_Knicely Administrator
    edited December 2020

    A quick example shows what @Sudhakar_B is refereing to :smile:

    dbadmin=> CREATE TABLE t_varchar_1 (c VARCHAR(1));
    CREATE TABLE
    
    dbadmin=> CREATE TABLE t_varchar_65000 (c VARCHAR(65000));
    CREATE TABLE
    
    dbadmin=> INSERT INTO t_varchar_1 SELECT 'A';
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> INSERT INTO t_varchar_65000 SELECT 'A';
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> SELECT current_trans_id(), current_statement(), c FROM t_varchar_1 WHERE c = 'A';
     current_trans_id  | current_statement | c
    -------------------+-------------------+---
     45035996273991281 |                 3 | A
    (1 row)
    
    dbadmin=> SELECT peak_memory_kb FROM query_consumption WHERE transaction_id = 45035996273991281 AND statement_id = 3;
     peak_memory_kb
    ----------------
               5657
    (1 row)
    
    dbadmin=> SELECT current_trans_id(), current_statement(), c FROM t_varchar_65000 WHERE c = 'A';
     current_trans_id  | current_statement | c
    -------------------+-------------------+---
     45035996273991281 |                 5 | A
    (1 row)
    
    dbadmin=> SELECT peak_memory_kb FROM query_consumption WHERE transaction_id = 45035996273991281 AND statement_id = 5;
     peak_memory_kb
    ----------------
              46209
    (1 row)
    

    When I first started working with Vertica as a consultant back in 2011, a client had defined ALL of their character fields as VARCHAR(65000)! Needless to say, they were complaining about query performance. We "right-sized" their VARCHARs, and voila, performance went through the roof!

  • edited December 2020

    Thank you for this awesomely clear answer. Providing those example performance checks is also extremely helpful showing how to test before posting to the forum.

  • Jim_KnicelyJim_Knicely Administrator

    @Pieter_Sheth-Vo - That was a great question!

    Note that the documentation does hint that it's a bad idea to over size your VARCHARs by saying:

    Due to compression in Vertica, the cost of overestimating the length of these fields is incurred primarily at load time and during sorts.

    See: https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/DataTypes/CharacterDataTypes.htm

  • @Jim_Knicely ,
    Is there a downside to setting JoinDefaultTupleFormat to "variable" at DB level? Even in 10.x the default is "fixed".
    The reason I am asking is the parameter suggests should be changed only under Vertica support guidance.
    Furthermore JFMT hint works only with SYNTACTIC_JOIN hint at individual query level.
    So is it possible that setting this at DB level disable some other optimizer options?
    I have done some extensive testing and this seems to work well.
    TIA

  • marcothesanemarcothesane - Select Field - Administrator

    Well, @Sudhakar_B ,
    As a matter of fact, yes, there is a downside in doing so.
    setting JoinDefaultTupleFormat to "variable" will not resolve all issues with oversized column types: Loading , sorting will remain penalised at database level; and, what hurts much more: all SQL clients - even vsql - will over-allocate fetch buffers for the maximum possible lengths. And those maximum possible lengths will be multiplied by the rowset - the number of rows you fetch with each fetch call. You'll run out of memory quickly on the middle tier or client side - or both. This applies to BI tools, ETL tools, and your favourite SQL client, to name a few.

    And, if you use your judgment and size your columns wisely, with JoinDefaultTupleFormat set to "variable" , you're then slower than with fixed formats, where simple pointer arithmetics can let you navigate between tuples, whereas you always have to calculate at atomic level for navigating the memory with variable buffer lengths.

    My suggestion is to run a SELECT MAX(OCTET_LENGTH(RTRIM(s))) for all suspicious strings, and use the result for re-dimensioning your VARCHARs.

    Fix the problems as high up in the food chain as you can.

  • @marcothesane ,
    That you for detailed response. I agree with your overall assessment and guidance.
    While I understand the impact of using "wide" columns, in the industry and Vertica customers the following points come into play.
    1. Vertica customers who build data-lake in Vertica tend to lean towards ELT than ETL. At some customer's sites data from several different independent data sources (and databases) are continuously ingested into Vertica with minimal transformation. And then all the transformations (technical and business) happen within Vertica. In such situation customers want to maintain "high-fidelity" of data types and model to the respective source systems, over which we have no-control. Furthermore, the expectation is that Vertica data-lake/EDW to be "more forgiving for data types" than the respective source operational systems.
    2. In such environments data-ingest and DML are normally at individual table level with minimal to no joins. Though they are penalized for wide columns usage the penalty is acceptable and can be managed. As the data ingest team has control over these processes.
    3. Data consumers from Vertica, BI tools, R/Python platforms, VSQL or other clients typically run large, complex "analytic & exploratory" queries in Vertica. Such queries typically perform several inefficient joins (on wide varchar columns), many computation and aggregation with-in Vertica on very large data-set and return much smaller aggregated results to the client. This is why I was focusing more on the JOIN (specially HASH) in Vertica and the JoinDefaultTupleFormat.
    4. Though I have seen some clients extracting "large result set" from Vertica for analytic processing in other platform, I dissuade them from following that practice as Vertica is designed for IN-DB Analytic workload.
    5. I have done some extensive testing with this parameter at one customer site (version 9.3.1) and found very favorable results. However since this is new parameter in 9.3, I wanted to reach-out to see if these are any known specific downside for setting the value to "variable".
    As I can't share the customer test cases here, I found a repeatable use-case in Vertica VMware version 10.0.0 to demonstrate the effectiveness of the parameter JoinDefaultTupleFormat.

    [dbadmin@localhost ~]$ vsql
    Password: 
    Welcome to vsql, the Vertica Analytic Database interactive terminal.
    
    Type:  \h or \? for help with vsql commands
           \g or terminate with semicolon to execute query
           \q to quit
    
    dbadmin=> select version(), node_address, sysdate() from nodes;
                   version               | node_address |          sysdate          
    -------------------------------------+--------------+---------------------------
     Vertica Analytic Database v10.0.0-0 | 127.0.0.1    | 2020-12-28 11:13:38.19846
    (1 row)
    
    dbadmin=> select parameter_name, default_value from configuration_parameters
    dbadmin-> where parameter_name ilike 'JoinDefault%';
         parameter_name     | default_value 
    ------------------------+---------------
     JoinDefaultTupleFormat | fixed
    (1 row)
    
    dbadmin=> alter session set JoinDefaultTupleFormat = 'fixed';
    ALTER SESSION
    dbadmin=> 
    dbadmin=> show session parameter JoinDefaultTupleFormat;
              name          | setting 
    ------------------------+---------
     JoinDefaultTupleFormat | fixed
    (1 row)
    
    dbadmin=> 
    dbadmin=> \timing
    Timing is on.
    dbadmin=> \o /dev/null
    dbadmin=> profile select * from  v_catalog.log_queries;
    NOTICE 4788:  Statement is being profiled
    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273989883 and statement_id=1;
    NOTICE 3557:  Initiator memory for query: [on pool sysquery: 53491 KB, minimum: 53491 KB]
    NOTICE 5077:  Total memory required by query: [2241306 KB]
    NOTICE 4788:  Statement is being profiled
    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273989883 and statement_id=2;
    NOTICE 3557:  Initiator memory for query: [on pool sysquery: 425180 KB, minimum: 425180 KB]
    NOTICE 5077:  Total memory required by query: [**2485156 KB**]
    ERROR 3815:  Join inner did not fit in memory [(v_catalog.vs_audit_categories x v_internal.dc_requests_completed) using vs_audit_categories_p and previous join (PATH ID: 2)]
    dbadmin=> 
    dbadmin=> 
    dbadmin=> **alter session set JoinDefaultTupleFormat = 'variable';**
    Time: First fetch (0 rows): 7.617 ms. All rows formatted: 7.650 ms
    dbadmin=> 
    dbadmin=> show session parameter JoinDefaultTupleFormat;
    Time: First fetch (1 row): 0.472 ms. All rows formatted: 0.490 ms
    dbadmin=> 
    dbadmin=> \o /dev/null
    dbadmin=> profile select * from  v_catalog.log_queries;
    NOTICE 4788:  Statement is being profiled
    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273989891 and statement_id=1;
    NOTICE 3557:  Initiator memory for query: [on pool sysquery: 54563 KB, minimum: 54563 KB]
    NOTICE 5077:  Total memory required by query: [**114479 KB**]
    Time: First fetch (88 rows): **2229.579** ms. All rows formatted: 2242.888 ms
    dbadmin=> 
    dbadmin=> 
    
  • edited December 2020

    @Sudhakar_B said:
    @marcothesane ,
    That you for detailed response. I agree with your overall assessment and guidance.
    While I understand the impact of using "wide" columns, in the industry and Vertica customers the following points come into play.
    1. Vertica customers who build data-lake in Vertica tend to lean towards ELT than ETL. At some customer's sites data from several different independent data sources (and databases) are continuously ingested into Vertica with minimal transformation. And then all the transformations (technical and business) happen within Vertica. In such situation customers want to maintain "high-fidelity" of data types and model to the respective source systems, over which we have no-control. Furthermore, the expectation is that Vertica data-lake/EDW to be "more forgiving for data types" than the respective source operational systems.
    2. In such environments data-ingest and DML are normally at individual table level with minimal to no joins. Though they are penalized for wide columns usage the penalty is acceptable and can be managed. As the data ingest team has control over these processes.
    3. Data consumers from Vertica, BI tools, R/Python platforms, VSQL or other clients typically run large, complex "analytic & exploratory" queries in Vertica. Such queries typically perform several inefficient joins (on wide varchar columns), many computation and aggregation with-in Vertica on very large data-set and return much smaller aggregated results to the client. This is why I was focusing more on the JOIN (specially HASH) in Vertica and the JoinDefaultTupleFormat.
    4. Though I have seen some clients extracting "large result set" from Vertica for analytic processing in other platform, I dissuade them from following that practice as Vertica is designed for IN-DB Analytic workload.
    5. I have done some extensive testing with this parameter at one customer site (version 9.3.1) and found very favorable results. However since this is new parameter in 9.3, I wanted to reach-out to see if these are any known specific downside for setting the value to "variable".
    As I can't share the customer test cases here, I found a repeatable use-case in Vertica VMware version 10.0.0 to demonstrate the effectiveness of the parameter JoinDefaultTupleFormat.

    [dbadmin@localhost ~]$ vsql
    Password: 
    Welcome to vsql, the Vertica Analytic Database interactive terminal.
    
    Type:  \h or \? for help with vsql commands
           \g or terminate with semicolon to execute query
           \q to quit
    
    dbadmin=> select version(), node_address, sysdate() from nodes;
                   version               | node_address |          sysdate          
    -------------------------------------+--------------+---------------------------
     Vertica Analytic Database v10.0.0-0 | 127.0.0.1    | 2020-12-28 11:13:38.19846
    (1 row)
    
    dbadmin=> select parameter_name, default_value from configuration_parameters
    dbadmin-> where parameter_name ilike 'JoinDefault%';
         parameter_name     | default_value 
    ------------------------+---------------
     JoinDefaultTupleFormat | fixed
    (1 row)
    
    dbadmin=> alter session set JoinDefaultTupleFormat = 'fixed';
    ALTER SESSION
    dbadmin=> 
    dbadmin=> show session parameter JoinDefaultTupleFormat;
              name          | setting 
    ------------------------+---------
     JoinDefaultTupleFormat | fixed
    (1 row)
    
    dbadmin=> 
    dbadmin=> \timing
    Timing is on.
    dbadmin=> \o /dev/null
    dbadmin=> profile select * from  v_catalog.log_queries;
    NOTICE 4788:  Statement is being profiled
    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273989883 and statement_id=1;
    NOTICE 3557:  Initiator memory for query: [on pool sysquery: 53491 KB, minimum: 53491 KB]
    NOTICE 5077:  Total memory required by query: [2241306 KB]
    NOTICE 4788:  Statement is being profiled
    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273989883 and statement_id=2;
    NOTICE 3557:  Initiator memory for query: [on pool sysquery: 425180 KB, minimum: 425180 KB]
    NOTICE 5077:  Total memory required by query: [**2485156 KB**]
    ERROR 3815:  Join inner did not fit in memory [(v_catalog.vs_audit_categories x v_internal.dc_requests_completed) using  vs_audit_categories_p and previous join (PATH ID: 2)] [https://omegle.site/](https://omegle.site/)
    dbadmin=> 
    dbadmin=> 
    dbadmin=> **alter session set JoinDefaultTupleFormat = 'variable';**
    Time: First fetch (0 rows): 7.617 ms. All rows formatted: 7.650 ms
    dbadmin=> 
    dbadmin=> show session parameter JoinDefaultTupleFormat;
    Time: First fetch (1 row): 0.472 ms. All rows formatted: 0.490 ms
    dbadmin=> 
    dbadmin=> \o /dev/null
    dbadmin=> profile select * from  v_catalog.log_queries;
    NOTICE 4788:  Statement is being profiled
    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273989891 and statement_id=1;
    NOTICE 3557:  Initiator memory for query: [on pool sysquery: 54563 KB, minimum: 54563 KB]
    NOTICE 5077:  Total memory required by query: [**114479 KB**]
    Time: First fetch (88 rows): **2229.579** ms. All rows formatted: 2242.888 ms
    dbadmin=> 
    dbadmin=> 
    

    Thank you for this awesomely clear answer. Providing those example performance checks is also extremely helpful showing how to test before posting to the forum.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file