Analyzing Table Statistics by Column

SarahLSarahL Administrator, Moderator, Employee, Registered User, VerticaExpert, CABuser

This blog post was authored by Jim Knicely.

The ANALYZE_STATISTICS function collects and aggregates data samples and storage information from all nodes that store projections associated with the specified table. On a very large wide table, it takes a significant amount of time to gather those statistics.

In many situations only a few columns in the table are modified as part of an ETL/ELT process. In these cases instead of reanalyzing all of the table’s columns, you can analyze just the updated columns, reducing the ANALYZE_STATISTICS function’s run time!

Example:

=> SELECT COUNT(*) FROM big_varchar_table;
   COUNT
------------
1000000000
(1 row)

=> \timing
Timing is on.

=> select analyze_statistics('big_varchar_table', 100); -- Analyze entire table
analyze_statistics
--------------------
                  0
(1 row)

Time: First fetch (1 row): 196707.123 ms. All rows formatted: 196707.185 ms

=> UPDATE big_varchar_table SET the_varchar = 'DATA UPDATED...' WHERE pk BETWEEN 2098174 AND 4098174;
OUTPUT
---------
2000001
(1 row)

Time: First fetch (1 row): 1909.409 ms. All rows formatted: 1909.474 ms

=> select analyze_statistics('big_varchar_table', 'the_varchar', 100); -- Analyze only updated column
analyze_statistics
--------------------
                  0
(1 row)

Time: First fetch (1 row): 48348.982 ms. All rows formatted: 48349.028 ms

By analyzing only the column that was updated I reduced the run time of the ANALYZE_STATISTICS function from 3.3 minutes to 48 seconds!

Sign In or Register to comment.