Where the analyze_stats.sh script looks like this:
dbadmin=> \! cat /home/dbadmin/scripts/analyze_stats.sh
vsql -Atc "SELECT DISTINCT 'SELECT analyze_statistics(''' || p.projection_schema || '.' || p.projection_basename || ''');' run_these FROM projections p JOIN projection_storage ps USING (projection_id) WHERE NOT p.has_statistics AND NOT p.is_aggregate_projection AND ps.row_count > 1;" | vsql -a
@Jim_Knicely How frequently will the statistics become stale? I ran the select analyze_statistics() for once manually about 3 days ago, and i have more that 750M records loaded into db everyday but as of today i can still see 'statistics_type' column in 'projection_columns' table is still having value of 'Full'
Also read somewhere that running analyze_statistics() unnecessarily on the table increases catalog size, is that true?
The info 'FULL' for the statistics_type means just one of the possible values: 'FULL','ROWCOUNT','NONE'. That also means that 'FULL' statistics can very well be stale.
Imagine you have a TIMESTAMP column named creation_ts, DEFAULT CURRENT_TIMESTAMP. If you ran ANALYZE_STATISTICS() on that column yesterday and today you inserted more rows - the statistics on that column will be full, but still, every query asking for today's data , say, WHERE creation_ts >= TRUNC(CURRENT_TIMESTAMP,'DD') AND creation_ts < TRUNC(CURRENT_TIMESTAMP,'DD') + 1 will search for data for which no statistics have yet been gathered. For that type of query, the table/column's statistics would be too stale to allow for an efficient query plan, no matter what the statistics_type for that column may be.
Answers
You can have Vertica build ANALYZE_STATISTICS commands to analyze stats on projections that do not have stats... Then have crontab run them.
Simple Example:
So your crontab entry might run once a day like this:
00 00 * * * /home/dbadmin/scripts/analyze_stats.sh > /home/dbadmin/scripts/analyze_stats.out 2>&1
Where the analyze_stats.sh script looks like this:
@Jim_Knicely How frequently will the statistics become stale? I ran the select analyze_statistics() for once manually about 3 days ago, and i have more that 750M records loaded into db everyday but as of today i can still see 'statistics_type' column in 'projection_columns' table is still having value of 'Full'
Also read somewhere that running analyze_statistics() unnecessarily on the table increases catalog size, is that true?
The info 'FULL' for the
statistics_type
means just one of the possible values: 'FULL','ROWCOUNT','NONE'. That also means that 'FULL' statistics can very well be stale.Imagine you have a
TIMESTAMP
column namedcreation_ts
,DEFAULT CURRENT_TIMESTAMP
. If you ranANALYZE_STATISTICS()
on that column yesterday and today you inserted more rows - the statistics on that column will be full, but still, every query asking for today's data , say,WHERE creation_ts >= TRUNC(CURRENT_TIMESTAMP,'DD') AND creation_ts < TRUNC(CURRENT_TIMESTAMP,'DD') + 1
will search for data for which no statistics have yet been gathered. For that type of query, the table/column's statistics would be too stale to allow for an efficient query plan, no matter what thestatistics_type
for that column may be.