Where the analyze_stats.sh script looks like this:
dbadmin=> \! cat /home/dbadmin/scripts/analyze_stats.sh
vsql -Atc "SELECT DISTINCT 'SELECT analyze_statistics(''' || p.projection_schema || '.' || p.projection_basename || ''');' run_these FROM projections p JOIN projection_storage ps USING (projection_id) WHERE NOT p.has_statistics AND NOT p.is_aggregate_projection AND ps.row_count > 1;" | vsql -a
@Jim_Knicely How frequently will the statistics become stale? I ran the select analyze_statistics() for once manually about 3 days ago, and i have more that 750M records loaded into db everyday but as of today i can still see 'statistics_type' column in 'projection_columns' table is still having value of 'Full'
Also read somewhere that running analyze_statistics() unnecessarily on the table increases catalog size, is that true?
The info 'FULL' for the statistics_type means just one of the possible values: 'FULL','ROWCOUNT','NONE'. That also means that 'FULL' statistics can very well be stale.
Imagine you have a TIMESTAMP column named creation_ts, DEFAULT CURRENT_TIMESTAMP. If you ran ANALYZE_STATISTICS() on that column yesterday and today you inserted more rows - the statistics on that column will be full, but still, every query asking for today's data , say, WHERE creation_ts >= TRUNC(CURRENT_TIMESTAMP,'DD') AND creation_ts < TRUNC(CURRENT_TIMESTAMP,'DD') + 1 will search for data for which no statistics have yet been gathered. For that type of query, the table/column's statistics would be too stale to allow for an efficient query plan, no matter what the statistics_type for that column may be.
Answers
You can have Vertica build ANALYZE_STATISTICS commands to analyze stats on projections that do not have stats... Then have crontab run them.
Simple Example:
dbadmin=> SELECT DISTINCT 'SELECT analyze_statistics(''' || p.projection_schema || '.' || p.projection_basename || ''');' run_these FROM projections p JOIN projection_storage ps USING (projection_id) WHERE NOT p.has_statistics AND NOT p.is_aggregate_projection AND ps.row_count > 0; run_these ------------------------------------------------ SELECT analyze_statistics('public.some_fact'); SELECT analyze_statistics('public.no_stats'); (2 rows) dbadmin=> \! vsql -Atc "SELECT DISTINCT 'SELECT analyze_statistics(''' || p.projection_schema || '.' || p.projection_basename || ''');' run_these FROM projections p JOIN projection_storage ps USING (projection_id) WHERE NOT p.has_statistics AND NOT p.is_aggregate_projection AND ps.row_count > 0;" | vsql -a SELECT analyze_statistics('public.some_fact'); analyze_statistics -------------------- 0 (1 row) SELECT analyze_statistics('public.no_stats'); analyze_statistics -------------------- 0 (1 row) dbadmin=> SELECT DISTINCT 'SELECT analyze_statistics(''' || p.projection_schema || '.' || p.projection_basename || ''');' run_these FROM projections p JOIN projection_storage ps USING (projection_id) WHERE NOT p.has_statistics AND NOT p.is_aggregate_projection AND ps.row_count > 0; run_these ----------- (0 rows)So your crontab entry might run once a day like this:
00 00 * * * /home/dbadmin/scripts/analyze_stats.sh > /home/dbadmin/scripts/analyze_stats.out 2>&1Where the analyze_stats.sh script looks like this:
dbadmin=> \! cat /home/dbadmin/scripts/analyze_stats.sh vsql -Atc "SELECT DISTINCT 'SELECT analyze_statistics(''' || p.projection_schema || '.' || p.projection_basename || ''');' run_these FROM projections p JOIN projection_storage ps USING (projection_id) WHERE NOT p.has_statistics AND NOT p.is_aggregate_projection AND ps.row_count > 1;" | vsql -a@Jim_Knicely How frequently will the statistics become stale? I ran the select analyze_statistics() for once manually about 3 days ago, and i have more that 750M records loaded into db everyday but as of today i can still see 'statistics_type' column in 'projection_columns' table is still having value of 'Full'
Also read somewhere that running analyze_statistics() unnecessarily on the table increases catalog size, is that true?
The info 'FULL' for the
statistics_typemeans just one of the possible values: 'FULL','ROWCOUNT','NONE'. That also means that 'FULL' statistics can very well be stale.Imagine you have a
TIMESTAMPcolumn namedcreation_ts,DEFAULT CURRENT_TIMESTAMP. If you ranANALYZE_STATISTICS()on that column yesterday and today you inserted more rows - the statistics on that column will be full, but still, every query asking for today's data , say,WHERE creation_ts >= TRUNC(CURRENT_TIMESTAMP,'DD') AND creation_ts < TRUNC(CURRENT_TIMESTAMP,'DD') + 1will search for data for which no statistics have yet been gathered. For that type of query, the table/column's statistics would be too stale to allow for an efficient query plan, no matter what thestatistics_typefor that column may be.