Varchar columns in Vertica 7

We currently have several columns in a Vertica table that are of type VARCHAR. Some of them have been defined as VARCHAR(255). However, the data load to this table (from Sqoop) fails if the number of characters in these columns exceed 255. No data is uploaded even if there is only one record out of a million that exceeds this limit. 

Can we increase the limit to 65000 for all the varchar columns  i.e. make all varchar columns as VARCHAR(65000). Is there a downside to this approach in terms of memory management or query performance. Will this slow down the queries?

Please clarify. Thank you.

Ravi 

Comments

  • Firstly, when you are pushing data into the table do you have "abort on error" specified? Second, You could try altering the data type of the column by running an alter table alter column set data type varchar(65000). I believe for varchar columns. vertica only uses as much as space as the size of the value you are inserting. You can always test and check the performance on a test table.
  • Giving larger values to your varchar data types might give you a slower query as the storage footprint increases as shown in this example :
     I have two table test and test2 with 12060 (repeated values)
    dbadmin=> \d test*                                     List of Fields by Tables
     Schema | Table | Column |      Type      | Size  | Default | Not Null | Primary Key | Foreign Key
    --------+-------+--------+----------------+-------+---------+----------+-------------+-------------
     public | test  | name   | varchar(255)   |   255 |         | f        | f           |
     public | test2 | name2  | varchar(65000) | 65000 |         | f        | f           |
    (2 rows)
    When seeing the storage used by these two tables i can see the footprint is different :
     dbadmin=>  select projection_name,row_count,used_bytes,ros_used_bytes,wos_used_bytes from projection_storage where projection_name like 'test%';;-[ RECORD 1 ]---+------------
    projection_name | test2_super
    row_count       | 12060
    used_bytes      | 1139
    ros_used_bytes  | 1139
    wos_used_bytes  | 0
    -[ RECORD 2 ]---+------------
    projection_name | test_super
    row_count       | 12060
    used_bytes      | 655
    ros_used_bytes  | 655
    wos_used_bytes  | 0
    So far i am ok with that !
    but what when i query the tables would the cost be different ? 
    -as shown here the cost seems to be different and the bigger data type value projection seems to consume more.
    explain select * from test;
     Access Path:
     +-STORAGE ACCESS for test [Cost: 17, Rows: 12K (5 RLE)] (PATH ID: 2)
     |  Projection: public.test_DBD_1_rep_t_node0001
     |  Materialize: test.name

     
     explain
     select * from test2;
     Access Path:
     +-STORAGE ACCESS for test2 [Cost: 32, Rows: 12K (5 RLE)] (PATH ID: 2)
     |  Projection: public.test2_DBD_2_rep_t_node0001
     |  Materialize: test2.name2
    Now let's run a profile and see how much resources each query will use:
    dbadmin=> profile select * from test; NOTICE 4788:  Statement is being profiled
    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273900532 and statement_id=46;
    NOTICE 3557:  Initiator memory for query: [on pool general: 6151 KB, minimum: 6151 KB]
    TEST2 -- table
    dbadmin=> profile select * from test2; NOTICE 4788:  Statement is being profiled
    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273900532 and statement_id=47;
    NOTICE 3557:  Initiator memory for query: [on pool general: 37951 KB, minimum: 37951 KB]
    Conclusion - even when using a varchar data type Vertica will still require more resources if the it's value is bigger then needed.
    - this my personal opinion and i pulled out this conclusions form my own tests.
  • My experience has been that varchar columns that are way bigger than they need to be will have performance implications. While it might sound reasonable to just make them 65000, you might be better off making them only slightly bigger than they currently are - 500, or maybe 1,000.  Or, force them to be 255 with some sort of preprocessing script. Seems like that would be the perfect job for a Perl script, perhaps.


  • The follow up question to this (and the original motivation for making VARCHAR's huge, like 65000, for every string column), is how do we do a table ALTER to lengthen varchar columns.

    For example, if I have table myschema.table1, with a column "widget_description" of type VARCHAR(255), how do I reliably lengthen the column should I need to.

    if I do:

    ALTER TABLE myschema.table1 ALTER COLUMN widget_description SET DATA TYPE varchar(512)

    Sometimes this works.

    Other times, I often get an error message to the effect of "column widget_description being used in projection myschema.table1_b1" or something like that. 

    To lengthen a column, i end up having to create a new table, and then inserting the entire contents of the original table into the new table, then dropping the old table, and then renaming the new table to the original table's name.







  •   This is because that table is using column widget_description as superprojection's segmentation clause.
    Check you object definition :
     select export_objects('','tbl_name');
      As an example try to add a column and next alter that column data type, no error will be shown. 
  • What if all the columns are used in the superprojection's segmentation clause?

    when I do export_objects, I get "SEGMENTED BY hash(column1, column2, column3, etc..)"

    More importantly, how do I lengthen a VARCHAR column without creating an entirely new table and renaming it?


  •  Is not about the table as is about the projection definition here.
    To fix this you need to ask yourself if you really need segmented projections, if your table is big and you need to have in segmented, make sure you create it without that column in the segmentation clause :
    Se example :
    --create table with a segmentation clause on col1
    CREATE TABLE public.aaa(
        col1 varchar(255),
        col2 varchar(255)
    )
    ORDER BY col1, col2
               SEGMENTED BY hash(col1) ALL NODES KSAFE 1;
    ;
    --alter the column 'col1' "IS IN SEGMENT CLAUSE"
    dbadmin=> ALTER TABLE aaa ALTER COLUMN col1 SET DATA TYPE varchar(512);ROLLBACK 2353:  Cannot alter type of column "col1" since it is referenced in the segmentation expression of projection "aaa_b1" 
    --alter the column 'col2' "IS NOT IN SEGMENT CLAUSE"
    dbadmin=> ALTER TABLE aaa ALTER COLUMN col2 SET DATA TYPE varchar(512);
    ALTER TABLE
    dbadmin=> \d aaa
                                        List of Fields by Tables
     Schema | Table | Column |     Type     | Size | Default | Not Null | Primary Key | Foreign Key
    --------+-------+--------+--------------+------+---------+----------+-------------+-------------
     public | aaa   | col1   | varchar(255) |  255 |         | f        | f           |
     public | aaa   | col2   | varchar(512) |  512 |         | f        | f           |
    (2 rows)
    Now if you insist/need on having you column  as the segmentation clause then you need to recreate the super projection(table) with your desired data types and vals.



  • Normally I wouldn't expect to see a varchar column in a segmentation clause. That sounds kind of strange to me. Normally you would use a primary key, or unique ID in the segmentation clause. Your goal is to ensure the data is evenly distributed across the nodes, and a single highly cardinal column will achieve that. 
  • I was trying to point to him why the error comes up. 
    His setting are all default, so that is why he is getting stuck. And varchar column will not be a good fit for segmentation clause, data skew will suffer.(the cluster will be as slow as its slowest node)
  • It looks like Vertica, by default, segments by the hash of all columns.

    If I understand correctly, the answer is to use a high cardinality column, preferably an int, to segment, and to specify it in the CREATE TABLE statement.


  • yes,
    take look at this article to see how to control data skew better 
  • It's all well and good to try to avoid data skew, but optimizing your projections for run-time performance for your basic joins and group bys is really more important than avoiding [relatively minor] skew.  The cost of dynamically resegmenting data for a join or for a group by, because the projection isn't segmented on the join key(s) or grouping columns, will usually be greater than the extra time caused by some skew.  If the data is skewed in a MAJOR way, you might want to think about avoiding the skew.

      --Sharon

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file