Fine-grained control over compressions used for some encoding types
I'm new to Vertica and attempting to understand how it handles encoding and compression of column data. Specifically, I'm interested in learning about whether or not it's possible to exactly control the encoding type and compression type for a column. For example, suppose that I created the following table and inserted some data:
CREATE TABLE test_table (company VARCHAR(90), order_count INT); CREATE PROJECTION test_table_super (company, order_count ENCODING RLE) AS SELECT company, order_count FROM test_table; INSERT INTO test_table VALUES ('BigCompany', 823);
If I then take a look at the compressions and encodings fields, I find:
select column_name, encodings, compressions from v_monitor.column_storage where projection_name = 'test_table_super'; column_name | encodings | compressions -------------+-----------+-------------- company | String | lzo order_count | RLE | lzo epoch | Int_Delta | none
The order_count field has RLE as its encoding and lzo as its compression. Would it be possible to force that column to use the bzip2 compression algorithm instead? Or maybe gzip even? Or is that kind of precise control just not exposed to the user?
Thanks!!
Tagged:
0
Answers
Something like?
Thanks so much for this suggestion. I thought that there would be something to allow changing the type of compression used.
This system parameter seems to apply to the entire database, right? Or does it apply to all databases in the cluster?
Out of curiosity, is it possible to control the compression algorithm used on a more fine-grained level? Such as having some columns in a table compressed with LZO and some with GZIP?
It's rare to see more than one (production) database on one cluster. But yeah the setting applies to the database. Maybe the other experts can chime in if there's a more fine-grained way. But, I'm more interested in what you're trying to achieve. There are many benchmarks out there that compare LZO vs GZIP vs BZIP2, and LZO is the better of the three that provides very fast decompression speeds (which in turn translates to more query performance) and that's probably why it's set as the default.
Regarding what I'm trying to accomplish, I am mainly just trying to understand what things are possible with Vertica. If it were possible to select different compressions for different fields, then it might be possible to do things like using ZSTD_FAST_COMP for some fields while keeping the default of LZO for other fields. I admit that I'm uncertain about whether or not this would actually be valuable.
I think that part of the reason why the usefulness of this capability may be limited is due to the fact that there are some "encoding types" that have both an encoding and a compression (like RLE), and a lot that don't. If you have a VARCHAR field, then the encoding is just "String" and then there is also a compression that you can specify, so there wouldn't be much use for any more fine-grained control there. If you have an INT field that has the DELTAVAL encoding type, then the encoding is listed as "Int_Delta" and the compression is "none". No idea if any form of compression is even a possibility there. Anyway, trying to understand this stuff is just part of the design process for enormous data volumes, where any tiny optimization is pursued aggressively.