simple and crisp explanation for encoding and compression

SRA1 · February 2016

Hi,

I couldnt find any clear explanation for encoding and compression in Vertica documentation.

The link https://my.vertica.com/docs/7.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/encoding-type.htm

just gives the encoding techniques but no correlation for compression except for Auto where they mentioned Auto encoding will lead to LZO for varchar types.

Could some one pls explain the encoding and compression in better.

Adrian_Oprea_1 · February 2016

Not sure what you mean by "correlation for compression".

I see it like this :

Vertica can apply encoding on top of data that means something so it depends on the data type of the data being encoded for it to chose the type of Encoding to apply.

Compression is preaty much "dark data" , Vertica will read it and would not be able to mapp it to any pattern that might apply encoding to, so will use AUTO that is preaty much an LZO.

Encoded data can be processed directly. It saves memory bandwidth and in some cases reduce IO througthput.

After encoding, the data in the block may also be compressed further. In other words, the data stored on the disk may have gone through two passes of encoding/compression.

When the block is read back from the disk, it will be decompressed into a buffer and processed encoded until the tuples need to be "materialized".

SRA1 · February 2016

What is the procedure to pick which encoding for which column. ?

Adrian_Oprea_1 · February 2016

You can either do it using DBD.

You can lookup for projecitons that have encoding set to auto and try to fix them.

Or if you a very smart DB Geek , you can make your own choice in what encoding to use.

I normally do this !

I look for projeciton that have on encodings on them:

select * from projection_columns where encoding_type = 'AUTO'  order by table_name

Run DESIGNER_DESIGN_PROJECTION_ENCODINGS to let Vertica pickup the most optimum encodings.

Example:

Messure the space ration before and after the encoding apply:

- just put in you tbl name and schema name.

SELECT
    tab.anchor_table_schema,
    tab.anchor_table_name,
    tab.GB_Table,
    proj.projection_name,
    proj.GB_Proj,
    col.column_name,
    col.MB_Col,
    col.row_count,
    col.encodings,
    col.compressions,
    colOrder.column_position,
    colOrder.sort_position
FROM
    (((
    (
        SELECT
            projection_storage.anchor_table_schema,
            projection_storage.anchor_table_name,
            ((SUM(projection_storage.used_bytes) / 1000000000::FLOAT))::INT AS GB_Table
        FROM
            v_monitor.projection_storage
            where projection_storage.anchor_table_name='table name'
            and projection_storage.anchor_table_schema='schema name'
        GROUP BY
            projection_storage.anchor_table_schema,
            projection_storage.anchor_table_name) tab
LEFT JOIN
    (
        SELECT
            projection_storage.anchor_table_schema,
            projection_storage.anchor_table_name,
            projection_storage.projection_name,
            ((SUM(projection_storage.used_bytes) / 1000000000::FLOAT))::INT AS GB_Proj
        FROM
            v_monitor.projection_storage
        GROUP BY
            projection_storage.anchor_table_schema,
            projection_storage.anchor_table_name,
            projection_storage.projection_name) proj
ON
    (((
                proj.anchor_table_schema = tab.anchor_table_schema)
        AND (
                proj.anchor_table_name = tab.anchor_table_name))))
LEFT JOIN
    (
        SELECT
            column_storage.anchor_table_schema,
            column_storage.anchor_table_name,
            column_storage.projection_name,
            column_storage.column_name,
            column_storage.encodings,
            column_storage.compressions,
            ((SUM(column_storage.used_bytes) / 1000000::FLOAT))::INT AS MB_Col,
            SUM(column_storage.row_count)                            AS row_count
        FROM
            v_monitor.column_storage
        GROUP BY
            column_storage.anchor_table_schema,
            column_storage.anchor_table_name,
            column_storage.projection_name,
            column_storage.column_name,
            column_storage.encodings,
            column_storage.compressions) col
ON
    (((
                col.anchor_table_schema = tab.anchor_table_schema)
        AND (
                col.anchor_table_name = tab.anchor_table_name)
        AND (
                col.projection_name = proj.projection_name))))
LEFT JOIN
    v_catalog.projection_columns colOrder
ON
    (((
                colOrder.table_schema = tab.anchor_table_schema)
        AND (
                colOrder.table_name = tab.anchor_table_name)
        AND (
                colOrder.projection_name = proj.projection_name)
        AND (
                colOrder.table_column_name = col.column_name))))
ORDER BY
    tab.GB_Table DESC,
    tab.anchor_table_name,
    proj.GB_Proj DESC,
    proj.projection_name,
    col.MB_Col DESC;

When you identify your projecitons with issues run the following :

SELECT distinct 
    'select DESIGNER_DESIGN_PROJECTION_ENCODINGS ('''|| projection_schema || '.'|| projection_basename||
    ''',''/tmp/'|| projection_schema|| '.'|| projection_basename|| '.sql'','|| '''$depoly true/false$'');'
FROM
    projections
WHERE
 anchor_table_name='$table name$'
 and projection_schema = '$schema name$';

Note:

- the script will generate the sytnax for you to run.

- if you don`t DbVizualizer , you can just replace the $var with your schema + table name. Aslo see the true/false option , where true means deploy the new DDL of the projeciton false means don`t deploy , just save the script so you can analyze it before you run it.

I hope this helps !

SRA1 · February 2016

Are you saying that we should try to skip AUTO encoding?

What are the other techniques preferred over AUTO.

eli_revach · February 2016

Hi

Many times AUTO ENCODING is just fine .

Column disk allocation size comparison between different encodings , is good approach , however is not necessarily always the right approach (each encoding has its own advantage and disadvantage , eg:resource consumption , more CPU cycles and etc ) , at the end , you need to tune your query for each projection you plan to used and decide upon it .

You can see encoding as kind of compression that it based on pre knowledge of the data , RLE is a simple and good example , if the data is ordered is very simple to store one peer + sequence , regular zip do not have any idea about the underlying data , it just binary compression

By the way AUTO encoding will internally define RLE encoding ( which is one you want to use ) for strings and delta int for numbers with length < 19

I hope you will fine it useful

Thanks

Adrian_Oprea_1 · February 2016

No i am not sayng you should avoid AUTO encoding, like ELI said there are times when you cannot avoid AUTO encoding (due to the data content you have in that column).

What i posted there is a way of you identifing possible missing encoding columns in your projections.

Encoding helps in query performance, space usage, less work overall from your database.

Use DBD and internal dbd functions to reach the optimum encoding type.

I recommend going over this topics "in Vertica docs" so you can get familiar with how it works.

eli_revach · March 2016

Hi

I think that now you have all the necessary tools /understanding to set the right projections encoding .

Thanks

SRA1 · March 2016

All the columns in the Projection are AUTO encoded (unless we specifically mention any other encoding type) what ever the encoding be all columns are equally compressed (binary).

There is an option to avoid encoding (no encoding) is the only way to skip encoding.

on top of it if we execute DBD the encoding might change from AUTO to more appropriate encoding type...

Am I correct

eli_revach · March 2016

Hi ,
Encoding is avoided only for specific cases , like when you have large numeric data type like numeric >19 .

In most of the cases you can't avoid encoding , when you not specified encoding the encoding will be AUTO .

Basically AUTO is not binary encoding , the idea in AUTO is that vertica choose the encoding based on your underline columns data type . Eg: for strings it will be RLE for numeric <19 it will be delete int .

On top of the encoding vertica also compress the column on binary level , eg: LZO for RLE encoding

Hope its more clear now

Thanks

We're Moving!

Create My New Community Account Now

simple and crisp explanation for encoding and compression

Comments

Leave a Comment