simple and crisp explanation for encoding and compression

Hi,

I couldnt find any clear explanation for encoding and compression in Vertica documentation.

The link https://my.vertica.com/docs/7.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/encoding-type.htm

 

just gives the encoding techniques but no correlation for compression except for Auto where they mentioned Auto encoding will lead to LZO for varchar types.

 

Could some one pls explain the encoding and compression in better.

Comments

  •  Not sure what you mean by "correlation for compression".

     

    I see it like this : 

      Vertica can apply encoding on top of data that means something so it depends on the data type of the data being encoded for it to chose the type of Encoding to apply.

      Compression is preaty much "dark data" , Vertica will read it and would not be able to mapp it to any pattern that might apply encoding to, so will use AUTO that is preaty much an LZO.

     

     

     Encoded data can be processed directly. It saves memory bandwidth and in some cases reduce IO througthput. 

      After encoding, the data in the block may also be compressed further. In other words, the data stored on the disk may have gone through two passes of encoding/compression.

      When the block is read back from the disk, it will be decompressed into a buffer and processed encoded until the tuples need to be "materialized". 

     

     

  • What is the procedure to pick which encoding for which column. ?

     

  •  

     You can either do it using DBD.

     You can lookup for projecitons that have encoding set to auto and try to fix them.

     Or if you a very smart DB Geek :), you can make your own choice in what encoding to use.

     

    I normally do this ! 

     

    I look for projeciton that have on encodings on them:

    select * from projection_columns where encoding_type = 'AUTO'  order by table_name

    Run DESIGNER_DESIGN_PROJECTION_ENCODINGS to let Vertica pickup the most optimum encodings.

     

     

     

    Example:

    Messure the space ration before and after the encoding apply:

    - just put in you tbl name and schema name.

    SELECT
    tab.anchor_table_schema,
    tab.anchor_table_name,
    tab.GB_Table,
    proj.projection_name,
    proj.GB_Proj,
    col.column_name,
    col.MB_Col,
    col.row_count,
    col.encodings,
    col.compressions,
    colOrder.column_position,
    colOrder.sort_position
    FROM
    (((
    (
    SELECT
    projection_storage.anchor_table_schema,
    projection_storage.anchor_table_name,
    ((SUM(projection_storage.used_bytes) / 1000000000::FLOAT))::INT AS GB_Table
    FROM
    v_monitor.projection_storage
    where projection_storage.anchor_table_name='table name'
    and projection_storage.anchor_table_schema='schema name'
    GROUP BY
    projection_storage.anchor_table_schema,
    projection_storage.anchor_table_name) tab
    LEFT JOIN
    (
    SELECT
    projection_storage.anchor_table_schema,
    projection_storage.anchor_table_name,
    projection_storage.projection_name,
    ((SUM(projection_storage.used_bytes) / 1000000000::FLOAT))::INT AS GB_Proj
    FROM
    v_monitor.projection_storage
    GROUP BY
    projection_storage.anchor_table_schema,
    projection_storage.anchor_table_name,
    projection_storage.projection_name) proj
    ON
    (((
    proj.anchor_table_schema = tab.anchor_table_schema)
    AND (
    proj.anchor_table_name = tab.anchor_table_name))))
    LEFT JOIN
    (
    SELECT
    column_storage.anchor_table_schema,
    column_storage.anchor_table_name,
    column_storage.projection_name,
    column_storage.column_name,
    column_storage.encodings,
    column_storage.compressions,
    ((SUM(column_storage.used_bytes) / 1000000::FLOAT))::INT AS MB_Col,
    SUM(column_storage.row_count) AS row_count
    FROM
    v_monitor.column_storage
    GROUP BY
    column_storage.anchor_table_schema,
    column_storage.anchor_table_name,
    column_storage.projection_name,
    column_storage.column_name,
    column_storage.encodings,
    column_storage.compressions) col
    ON
    (((
    col.anchor_table_schema = tab.anchor_table_schema)
    AND (
    col.anchor_table_name = tab.anchor_table_name)
    AND (
    col.projection_name = proj.projection_name))))
    LEFT JOIN
    v_catalog.projection_columns colOrder
    ON
    (((
    colOrder.table_schema = tab.anchor_table_schema)
    AND (
    colOrder.table_name = tab.anchor_table_name)
    AND (
    colOrder.projection_name = proj.projection_name)
    AND (
    colOrder.table_column_name = col.column_name))))
    ORDER BY
    tab.GB_Table DESC,
    tab.anchor_table_name,
    proj.GB_Proj DESC,
    proj.projection_name,
    col.MB_Col DESC;

      When you identify  your projecitons with issues run the following :

    SELECT distinct 
    'select DESIGNER_DESIGN_PROJECTION_ENCODINGS ('''|| projection_schema || '.'|| projection_basename||
    ''',''/tmp/'|| projection_schema|| '.'|| projection_basename|| '.sql'','|| '''$depoly true/false$'');'
    FROM
    projections
    WHERE
    anchor_table_name='$table name$'
    and projection_schema = '$schema name$';

    Note:

     

    - the script will generate the sytnax for you to run. 

     

    - if you don`t DbVizualizer , you can just replace the $var with your schema + table name. Aslo see the true/false option , where true means deploy the new DDL of the projeciton false means don`t deploy , just save the script so you can analyze it before you run it.

     

    I hope this helps ! 

     

     

  • Are you saying that we should try to skip AUTO encoding?

    What are the other techniques preferred over AUTO.

  • Hi

    Many times AUTO ENCODING is just fine .

    Column disk allocation size comparison between different encodings , is good approach , however is not necessarily always the right approach (each encoding has its own advantage and disadvantage , eg:resource consumption , more CPU cycles and etc ) , at the end , you need to tune your query for each projection you plan to used and decide upon it .

     

    You can see encoding as kind of compression that it based on pre knowledge of the data , RLE is a simple and good example , if the data is ordered is very simple to store one peer + sequence , regular zip do not have any idea about the underlying data , it just binary compression

     

    By the way AUTO encoding will internally define RLE encoding ( which is one you want to use ) for strings and delta int for numbers with length < 19

     

    I hope you will fine it useful

    Thanks

  •   No i am not sayng you should avoid AUTO encoding, like ELI said there are times when you cannot avoid AUTO encoding (due to the data content you have in that column).

     What i posted there is a way of you identifing possible missing encoding columns in your projections.

     Encoding helps in query performance, space usage, less work overall from your database. 

     Use DBD and internal dbd functions to reach the optimum encoding type.

     

    I recommend going over this topics "in Vertica docs" so you can get familiar with how it works. 

     

     

  • Hi


    I think that now you have all the necessary tools /understanding to set the right projections encoding .

     

    Thanks 

  • All the columns in the Projection are AUTO encoded (unless we specifically mention any other encoding type) what ever the encoding be all columns are equally compressed (binary).

    There is an option to avoid encoding (no encoding) is the only way to skip encoding.

    on top of it if we execute DBD the encoding might change from AUTO to more appropriate encoding type...

     

    Am I correct

  • Hi ,
    Encoding is avoided only for specific cases  , like when you have large numeric data type like numeric >19 .

     

    In most of the cases you can't avoid encoding , when you not specified encoding the encoding will be AUTO .

     

    Basically AUTO is not binary encoding , the idea in AUTO is that vertica choose the encoding based on your underline columns data type . Eg: for strings it will be RLE for numeric <19 it will be delete int .

     

    On top of the encoding vertica also compress the column on binary level , eg: LZO for RLE encoding

     

    Hope its more clear now 

     

    Thanks 

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file