How does Vertica evaluate columns for encoding type?
I pulled the encoding types for the projection columns in my environment and was curious how Vertica evaluates columns/data in choosing an encoding type.
I would like to improve the columns with AUTO to use a more appropriate encoding. Would the query-specific DBD run be the best way to do so?
data_type | encoding_type | count ----------------+------------------+------- float | AUTO | 2782 float | RLE | 1214 float | BLOCKDICT_COMP | 1100 float | DELTARANGE_COMP | 1026 int | DELTARANGE_COMP | 804 int | RLE | 696 int | AUTO | 454 date | RLE | 360 int | BLOCKDICT_COMP | 328 int | COMMONDELTA_COMP | 292 float | COMMONDELTA_COMP | 286 date | BLOCKDICT_COMP | 158 timestamp | GCDDELTA | 148 timestamp | DELTARANGE_COMP | 140 timestamp | AUTO | 130 timestamp | RLE | 114 float | BLOCK_DICT | 90 date | COMMONDELTA_COMP | 80 timestamp | COMMONDELTA_COMP | 56 date | AUTO | 36 timestamp | BLOCKDICT_COMP | 34 boolean | RLE | 28 int | DELTAVAL | 22 boolean | AUTO | 16 date | DELTAVAL | 14 int | BLOCK_DICT | 10 int | GCDDELTA | 8 numeric(8,3) | DELTARANGE_COMP | 8 numeric(9,3) | COMMONDELTA_COMP | 8 numeric(10,3) | BLOCKDICT_COMP | 8 date | DELTARANGE_COMP | 6 numeric(19,5) | BLOCKDICT_COMP | 4 numeric(37,15) | AUTO | 4 date | BLOCK_DICT | 2Since the columns with an AUTO encoding go to the default encoding outlined in the documentation, why aren't some of these columns following the default encoding? For example, the documentation shows that columns with a float type would use either COMMONDELTA_COMP or DELTARANGE_COMP. How did Vertica evaluate to use RLE or BLOCKDICT_COMP for float?
I would like to improve the columns with AUTO to use a more appropriate encoding. Would the query-specific DBD run be the best way to do so?
0
Comments
I believe running DBD with good amount of sample data will give your much better projections with respect to encoding types.
Besides, there isone of the good answer on similar question
https://community.vertica.com/vertica/topics/projections_with_encoding_possibilities
Hope this helps.
The referenced question doesn't provide any information on the mechanism for determining encoding types.
The DBD *is* the automation of the internal process :-) It is the (only) mechanism within Vertica for automatically determining encoding types. What are you really trying to do?
Are you thinking of the DBD as just the ncurses-based admintools UI? In that case, your information is (slightly) out-of-date; that's just the GUI. Please see our documentation; for example:
https://my.vertica.com/docs/7.0.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/VerticaFunct...
The thing about column encodings that's tricky is that the correct encoding depends hugely on the arrangement of the data, which means both what your data looks like and how the particular projection has sorted it.
The actual encoding-level work of the DBD is really quite straightforward (though perhaps not simple; also not fast) -- it's a bunch of introspection on your data followed by some math. No canned heuristics, if that's what you're looking for; in practice those seem to get it wrong as often as they get it right, and with encodings, getting it wrong can be worse than having no encoding at all. We instead do the whole thing of building a mathematical model of your data and solving for the optimal encoding. That's not something that is easily broken apart; it works on the granularity of whole tables/projections, so that's the API that we provide.
(It has been the case in the past that hand-created projections constructed by someone with lots of Vertica experience could sometimes achieve better compression than our math. This was because our mathematical model was a bit oversimplified -- we assumed that each column should be considered in isolation. It turns out that this is false in practice; the content in one column can affect how you should consider other columns on the same projection. We hope that the new "Correlations Mode" in Vertica 7 will address these cases. If you can still do better by hand, we'd be very interested to hear from you.)
Adam
That was insightful. What I'm actually trying to do is avoid going to the DBD and have an automated process that picks up AUTO encoded columns and determines a suitable encoding type for them.
I think I'll wait to upgrade to 7 and use the DBD API.