Optimizer - Wrong estimated # of rows
Hi,
I'm trying to improve long running query that has in the plan BROADCAST.
I believe the optimizer decided to do BROADCAST because his estimation of previous JOIN is wrong (estimates 95K rows instead of 22M).
The join:SELECT * FROM dwh.very_big_fact_table v INNER JOIN dwh.population_table as p ON v.product_id=p.product_id and v.event_date=p.event_date WHERE v.event_date between '2018-07-29' and '2018-08-02';
dwh.population_table: unsegmented, order by in projection: event_date, product_id
dwh.very_big_fact_table: partitioned by year+week, segmented by column XXX , order by in projection: event_date, column YYY, column ZZZ
QUERY PLAN
Access Path:
+-JOIN HASH [Cost: 641K, Rows: 95K (NO STATISTICS)] (PATH ID: 1)
| Join Cond: (v.product_id = p.product_id) AND (v.event_date = p.event_date)
| Materialize at Output: v.user_id, v.event_id, v.event_name, v.app_event_id, v.install_time, v.event_time, v.install_date, v.event_week_day, v.adset_id, v.adset, v.os_id, v.os_name, v.os_version, v.idfv, v.app_id, v.dim_google_campaign_id, v.campaign_id, v.dim_campaign_name, v.channel, v.device_type, v.campaign_name, v.idfa, v.keywords, v.sub_param_1, v.sub_param_2, v.sub_param_3, v.sub_param_4, v.sub_param_5, v.ad_id, v.creative, v.ad_type, v.event_revenue_usd, v.event_value, v.android_id, v.media_source_id, v.media_source, v.appsflyer_id, v.country_id, v.country, v.agency_id, v.partner, v.adset_name, v.site_id, v.sub_site_id, v.invoke_source_id, v.udid, v.source_user_id, v.source_user_sn_id, v.destination_user_id, v.studio_platform_id, v.campaign_key, v.is_attached, v.original_udid, v.original_source_user_id, v.original_source_user_sn_id, v.original_destination_user_id, v.original_install_date, v.sys_loading_date, v.campaign_nkey
| Execute on: All Nodes
| +-- Outer -> STORAGE ACCESS for v [Cost: 72K, Rows: 22M] (PATH ID: 2)
| | Projection: dwh.very_big_fact_table_b1
| | Materialize: v.product_id, v.event_date
| | Filter: ((v.event_date >= '2018-07-29'::date) AND (v.event_date <= '2018-08-02'::date))
| | Filter: (v.event_id = ANY (ARRAY[1, 2, 3, 5]))
| | Execute on: All Nodes
| | Runtime Filters: (SIP1(HashJoin): v.product_id), (SIP2(HashJoin): v.event_date), (SIP3(HashJoin): v.product_id, v.event_date)
| +-- Inner -> STORAGE ACCESS for p [Cost: 52, Rows: 27 (NO STATISTICS)] (PATH ID: 3)
| | Projection: dwh.population_table
| | Materialize: p.event_date, dailyp_run.product_id, p.sys_loading_date
| | Filter: ((p.event_date >= '2018-07-29'::date) AND (p.event_date <= '2018-08-02'::date))
| | Execute on: All Nodes``
The WHERE clause is generated dynamically in the ETL process according to population table data, to "help" the optimizer filter the data according to relevant dates.
So because the optimizer "misses" in the Join resulted rows, he preform BROADCAST afterwards.
I tried couple things to workaround this, including:
1. Partition population table like fact table (Year+Week)
2. Changed order by of fact table projection to be aligned to join columns.
3. Removed WHERE clause.
Nothing helped.
Its not clear how the optimiser gets to this result, and how to overcome the bad estimation.
Thanks!
Comments
Notice the "NO STATISTICS" hint in the explain plan. Do you have stats on the dwh.population_table?
SELECT analyze_statistics('dwh.population_table');
See:
https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/AdministratorsGuide/Statistics/ReactingToStaleStatistics.htm
I tried that too, the NO STATISTICS was removed from STORAGE ACCESS of dwh.population_table.
Estimated number of rows remained the same.
Hi,
Note that the "rows" count is the number of rows that are actually involved in the join. That number can be lower that the table row count.
Example:
In the explain plan the table "tabll1" has 3 rows, but the rows for storage access only shows 1 and the table "table2" has 3 rows, but the explain plan only shows 2 rows.
See:
http://www.vertica-forums.com/viewtopic.php?t=2354
Where are you seeing estimated rows?
| +-- Outer -> STORAGE ACCESS for v [Cost: 72K, Rows: 22M] (PATH ID: 2)
That 22M is from the result of the join, with all of the join optimizations. Why is that incorrect?
Path:
+-JOIN HASH [Cost: 641K, Rows: 95K (NO STATISTICS)] (PATH ID: 1)
The output of the join is 95K rows...
That row count is just an estimate. When you generate an explain plan Vertica is not executing the query so you won't get the exact row count. If you have good stats, the estimated rows should be close to the actual row count.
Example:
The explain shows an estimated row count for the HASH join to be 671K.
At the end, regarding my original query that causes issues, I handled it by forcing outer on the big table. The query ran in reasnobale time and didn't throw memory exception.
If you have valid stats on the tables Vertica should pick the correct INNER table in the HASH join. However, according to the docs, "Occasionally, the optimizer might choose the larger table as the inner input to a join." You took the right approach to "control the join inputs" manually.
See:
https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/AnalyzingData/Queries/Joins/ForceInnerOuterJoinInputs.htm