How does Vertica evaluate columns for encoding type?

I pulled the encoding types for the projection columns in my environment and was curious how Vertica evaluates columns/data in choosing an encoding type.
   data_type    |  encoding_type   | count  ----------------+------------------+-------   float          | AUTO             |  2782   float          | RLE              |  1214   float          | BLOCKDICT_COMP   |  1100   float          | DELTARANGE_COMP  |  1026   int            | DELTARANGE_COMP  |   804   int            | RLE              |   696   int            | AUTO             |   454   date           | RLE              |   360   int            | BLOCKDICT_COMP   |   328   int            | COMMONDELTA_COMP |   292   float          | COMMONDELTA_COMP |   286   date           | BLOCKDICT_COMP   |   158   timestamp      | GCDDELTA         |   148   timestamp      | DELTARANGE_COMP  |   140   timestamp      | AUTO             |   130   timestamp      | RLE              |   114   float          | BLOCK_DICT       |    90   date           | COMMONDELTA_COMP |    80   timestamp      | COMMONDELTA_COMP |    56   date           | AUTO             |    36   timestamp      | BLOCKDICT_COMP   |    34   boolean        | RLE              |    28   int            | DELTAVAL         |    22   boolean        | AUTO             |    16   date           | DELTAVAL         |    14   int            | BLOCK_DICT       |    10   int            | GCDDELTA         |     8   numeric(8,3)   | DELTARANGE_COMP  |     8   numeric(9,3)   | COMMONDELTA_COMP |     8   numeric(10,3)  | BLOCKDICT_COMP   |     8   date           | DELTARANGE_COMP  |     6   numeric(19,5)  | BLOCKDICT_COMP   |     4   numeric(37,15) | AUTO             |     4   date           | BLOCK_DICT       |     2
Since the columns with an AUTO encoding go to the default encoding outlined in the documentation, why aren't some of these columns following the default encoding? For example, the documentation shows that columns with a float type would use either COMMONDELTA_COMP or DELTARANGE_COMP. How did Vertica evaluate to use RLE or BLOCKDICT_COMP for float?

I would like to improve the columns with AUTO to use a more appropriate encoding. Would the query-specific DBD run be the best way to do so?

Comments

  • Navin_CNavin_C Vertica Customer
    Hi Norbert,

    I believe running DBD with good amount of sample data will give your much better projections with respect to encoding types.

    Besides, there isone of the good answer on similar question

    https://community.vertica.com/vertica/topics/projections_with_encoding_possibilities

    Hope this helps.
  • While running the DBD to improve 5000 AUTO columns may be fun for some, I prefer to understand the internal process that evaluates the data to choose the best type and automate it.

    The referenced question doesn't provide any information on the mechanism for determining encoding types.
  • Ni Norbert.

    The DBD *is* the automation of the internal process :-)  It is the (only) mechanism within Vertica for automatically determining encoding types.  What are you really trying to do?

    Are you thinking of the DBD as just the ncurses-based admintools UI?  In that case, your information is (slightly) out-of-date; that's just the GUI.  Please see our documentation; for example:

    https://my.vertica.com/docs/7.0.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/VerticaFunct...

    The thing about column encodings that's tricky is that the correct encoding depends hugely on the arrangement of the data, which means both what your data looks like and how the particular projection has sorted it.

    The actual encoding-level work of the DBD is really quite straightforward (though perhaps not simple; also not fast) -- it's a bunch of introspection on your data followed by some math.  No canned heuristics, if that's what you're looking for; in practice those seem to get it wrong as often as they get it right, and with encodings, getting it wrong can be worse than having no encoding at all.  We instead do the whole thing of building a mathematical model of your data and solving for the optimal encoding.  That's not something that is easily broken apart; it works on the granularity of whole tables/projections, so that's the API that we provide.

    (It has been the case in the past that hand-created projections constructed by someone with lots of Vertica experience could sometimes achieve better compression than our math.  This was because our mathematical model was a bit oversimplified -- we assumed that each column should be considered in isolation.  It turns out that this is false in practice; the content in one column can affect how you should consider other columns on the same projection.  We hope that the new "Correlations Mode" in Vertica 7 will address these cases.  If you can still do better by hand, we'd be very interested to hear from you.)

    Adam
  • Adam,

    That was insightful. What I'm actually trying to do is avoid going to the DBD and have an automated process that picks up AUTO encoded columns and determines a suitable encoding type for them.

    I think I'll wait to upgrade to 7 and use the DBD API.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file