Checking and improving column compression and encoding
Vertica tip: checking and improving column compression and encoding
When working with terabytes of data, storage and transfer become major time and cost sinks. Vertica can help minimize storage cost and transfer time with column compression and encoding.
How can we identify Vertica tables that might benefit from compression? Information about column size and current compression is stored across column_storage and projection_columns table. The following query will locate tables that have default (AUTO) compression, which could be sub-optimal or even no compression at all: (The WHERE clause eliminates flex tables[1] and system schemas that can't be edited)
select distinct anchor_table_schema,anchor_table_name,MAX(encoding_type),SUM(ros_used_bytes)
from column_storage cs inner join projection_columns pc using (column_id)
where column_name NOT IN ('__raw__','__identity__') AND anchor_table_schema NOT LIKE '%dbd%'
group by anchor_table_schema,anchor_table_name having max(encoding_type) = 'AUTO'
order by anchor_table_schema,anchor_table_name;
Let's work with an example:
anchor_table_schema | anchor_table_name | MAX | SUM
---------------------+---------------------------+------+------------
public | big_fact_table | AUTO | 6841357747
This table has no projections with compression. We can have Vertica determine optimized compression using the meta-function DESIGNER_DESIGN_PROJECTION_ENCODINGS, which will output a projection definition with compression:
SELECT DESIGNER_DESIGN_PROJECTION_ENCODINGS('public.big_fact_table','');
This will output a script that will create a new projection and drop the existing one (likely a super-projection). Let's first see how much space we could save though. I'll run the script right up to the MAKE_AHM_NOW()
so there's data in the projection and compare the projection sizes:
=> select anchor_table_name, projection_name, used_bytes from projection_storage where anchor_table_name = 'big_fact_table' limit 5;
anchor_table_name | projection_name | used_bytes
-------------------+-----------------------------------------+------------
big_fact_table | big_fact_table_DBD_1_seg_EncodingDesign | 2331490442
big_fact_table | big_fact_table_super | 6843970195
That's a 65% size reduction - your mileage may vary, of course! But at terabyte scale, even a smaller savings will still help with storage usage and processing speed because you're now moving about that much less data over network and I/O channels too.
Have fun and get saving! Contact us on the forum if you'd like to talk about projection design and compression.
For more information:
COLUMN_STORAGE system table: https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/SystemTables/MONITOR/COLUMN_STORAGE.htm
PROJECTION_COLUMNS system table: https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/SystemTables/CATALOG/PROJECTION_COLUMNS.htm
PROJECTION_STORAGE system table: https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/SystemTables/MONITOR/PROJECTION_STORAGE.htm
DataBase Designer DESIGNER_DESIGN_PROJECTION_ENCODINGS meta-function: https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/DatabaseDesigner/DESIGNER_DESIGN_PROJECTION_ENCODINGS.htm
[1] It is possible to run DESIGNER_DESIGN_PROJECTION_ENCODINGS on materialized columns in flex tables, but you need to run on the materialized projection, not the flex table itself. Here's an abridged example:
dbadmin=> CREATE FLEX TABLE foo();
dbadmin=> INSERT /*+ DIRECT */ INTO foo SELECT c FROM some_big_table;
dbadmin=> SELECT compute_flextable_keys('foo');
dbadmin=> SELECT materialize_flextable_columns('foo', 1);
dbadmin=> SELECT projection_name, encoding_type FROM projection_columns WHERE projection_column_name = 'c' AND projection_name ILIKE 'foo%';
projection_name | encoding_type
-----------------+---------------
foo_b1 | AUTO
foo_b0 | AUTO
(2 rows)
dbadmin=> SELECT DISTINCT projection_basename FROM projections WHERE anchor_table_name = 'foo';projection_basename
foo
(1 row)
dbadmin=> SELECT designer_design_projection_encodings('foo', '/home/dbadmin/foo.sql', TRUE, TRUE);
dbadmin=> SELECT projection_name, encoding_type FROM projection_columns WHERE projection_column_name = 'c' AND projection_name ILIKE 'foo%';
projection_name | encoding_type
-----------------+------------------
foo_b1 | COMMONDELTA_COMP
foo_b0 | COMMONDELTA_COMP
(2 rows)
Comments
I will suggest some caution around the _COMP encoding types - which Database Designer loves to recommend - when using them with very large tables. Database Designer puts a priority on compression and minimizing I/O. But, as documented, the _COMP encoding types are CPU intensive. I've seen significant query performance degradation with these, to the extent that I make it common practice to remove them after generating the design.