simple and crisp explanation for encoding and compression
Hi,
I couldnt find any clear explanation for encoding and compression in Vertica documentation.
just gives the encoding techniques but no correlation for compression except for Auto where they mentioned Auto encoding will lead to LZO for varchar types.
Could some one pls explain the encoding and compression in better.
0
Comments
Not sure what you mean by "correlation for compression".
I see it like this :
Vertica can apply encoding on top of data that means something so it depends on the data type of the data being encoded for it to chose the type of Encoding to apply.
Compression is preaty much "dark data" , Vertica will read it and would not be able to mapp it to any pattern that might apply encoding to, so will use AUTO that is preaty much an LZO.
Encoded data can be processed directly. It saves memory bandwidth and in some cases reduce IO througthput.
After encoding, the data in the block may also be compressed further. In other words, the data stored on the disk may have gone through two passes of encoding/compression.
When the block is read back from the disk, it will be decompressed into a buffer and processed encoded until the tuples need to be "materialized".
What is the procedure to pick which encoding for which column. ?
You can either do it using DBD.
You can lookup for projecitons that have encoding set to auto and try to fix them.
Or if you a very smart DB Geek , you can make your own choice in what encoding to use.
I normally do this !
I look for projeciton that have on encodings on them:
Run DESIGNER_DESIGN_PROJECTION_ENCODINGS to let Vertica pickup the most optimum encodings.
Example:
Messure the space ration before and after the encoding apply:
- just put in you tbl name and schema name.
When you identify your projecitons with issues run the following :
Note:
- the script will generate the sytnax for you to run.
- if you don`t DbVizualizer , you can just replace the $var with your schema + table name. Aslo see the true/false option , where true means deploy the new DDL of the projeciton false means don`t deploy , just save the script so you can analyze it before you run it.
I hope this helps !
Are you saying that we should try to skip AUTO encoding?
What are the other techniques preferred over AUTO.
Hi
Many times AUTO ENCODING is just fine .
Column disk allocation size comparison between different encodings , is good approach , however is not necessarily always the right approach (each encoding has its own advantage and disadvantage , eg:resource consumption , more CPU cycles and etc ) , at the end , you need to tune your query for each projection you plan to used and decide upon it .
You can see encoding as kind of compression that it based on pre knowledge of the data , RLE is a simple and good example , if the data is ordered is very simple to store one peer + sequence , regular zip do not have any idea about the underlying data , it just binary compression
By the way AUTO encoding will internally define RLE encoding ( which is one you want to use ) for strings and delta int for numbers with length < 19
I hope you will fine it useful
Thanks
No i am not sayng you should avoid AUTO encoding, like ELI said there are times when you cannot avoid AUTO encoding (due to the data content you have in that column).
What i posted there is a way of you identifing possible missing encoding columns in your projections.
Encoding helps in query performance, space usage, less work overall from your database.
Use DBD and internal dbd functions to reach the optimum encoding type.
I recommend going over this topics "in Vertica docs" so you can get familiar with how it works.
Hi
I think that now you have all the necessary tools /understanding to set the right projections encoding .
Thanks
All the columns in the Projection are AUTO encoded (unless we specifically mention any other encoding type) what ever the encoding be all columns are equally compressed (binary).
There is an option to avoid encoding (no encoding) is the only way to skip encoding.
on top of it if we execute DBD the encoding might change from AUTO to more appropriate encoding type...
Am I correct
Hi ,
Encoding is avoided only for specific cases , like when you have large numeric data type like numeric >19 .
In most of the cases you can't avoid encoding , when you not specified encoding the encoding will be AUTO .
Basically AUTO is not binary encoding , the idea in AUTO is that vertica choose the encoding based on your underline columns data type . Eg: for strings it will be RLE for numeric <19 it will be delete int .
On top of the encoding vertica also compress the column on binary level , eg: LZO for RLE encoding
Hope its more clear now
Thanks