The Vertica Forum is getting a makeover! The site will be undergoing maintenance from Tuesday 8/13 to Friday 8/16. We appreciate your patience and cooperation during this time.

Analyzing Table Statistics by Column

This blog post was authored by Jim Knicely.

The ANALYZE_STATISTICS function collects and aggregates data samples and storage information from all nodes that store projections associated with the specified table. On a very large wide table, it takes a significant amount of time to gather those statistics.

In many situations only a few columns in the table are modified as part of an ETL/ELT process. In these cases instead of reanalyzing all of the table’s columns, you can analyze just the updated columns, reducing the ANALYZE_STATISTICS function’s run time!

Example:

=> SELECT COUNT(*) FROM big_varchar_table;
   COUNT
------------
1000000000
(1 row)

=> \timing
Timing is on.

=> select analyze_statistics('big_varchar_table', 100); -- Analyze entire table
analyze_statistics
--------------------
                  0
(1 row)

Time: First fetch (1 row): 196707.123 ms. All rows formatted: 196707.185 ms

=> UPDATE big_varchar_table SET the_varchar = 'DATA UPDATED...' WHERE pk BETWEEN 2098174 AND 4098174;
OUTPUT
---------
2000001
(1 row)

Time: First fetch (1 row): 1909.409 ms. All rows formatted: 1909.474 ms

=> select analyze_statistics('big_varchar_table', 'the_varchar', 100); -- Analyze only updated column
analyze_statistics
--------------------
                  0
(1 row)

Time: First fetch (1 row): 48348.982 ms. All rows formatted: 48349.028 ms

By analyzing only the column that was updated I reduced the run time of the ANALYZE_STATISTICS function from 3.3 minutes to 48 seconds!

Sign In or Register to comment.

Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.