Optimizer - Wrong estimated # of rows

Hi,

I'm trying to improve long running query that has in the plan BROADCAST.
I believe the optimizer decided to do BROADCAST because his estimation of previous JOIN is wrong (estimates 95K rows instead of 22M).

The join:
SELECT * FROM dwh.very_big_fact_table v INNER JOIN dwh.population_table as p ON v.product_id=p.product_id and v.event_date=p.event_date WHERE v.event_date between '2018-07-29' and '2018-08-02';

dwh.population_table: unsegmented, order by in projection: event_date, product_id
dwh.very_big_fact_table: partitioned by year+week, segmented by column XXX , order by in projection: event_date, column YYY, column ZZZ

QUERY PLAN
Access Path:
+-JOIN HASH [Cost: 641K, Rows: 95K (NO STATISTICS)] (PATH ID: 1)
| Join Cond: (v.product_id = p.product_id) AND (v.event_date = p.event_date)
| Materialize at Output: v.user_id, v.event_id, v.event_name, v.app_event_id, v.install_time, v.event_time, v.install_date, v.event_week_day, v.adset_id, v.adset, v.os_id, v.os_name, v.os_version, v.idfv, v.app_id, v.dim_google_campaign_id, v.campaign_id, v.dim_campaign_name, v.channel, v.device_type, v.campaign_name, v.idfa, v.keywords, v.sub_param_1, v.sub_param_2, v.sub_param_3, v.sub_param_4, v.sub_param_5, v.ad_id, v.creative, v.ad_type, v.event_revenue_usd, v.event_value, v.android_id, v.media_source_id, v.media_source, v.appsflyer_id, v.country_id, v.country, v.agency_id, v.partner, v.adset_name, v.site_id, v.sub_site_id, v.invoke_source_id, v.udid, v.source_user_id, v.source_user_sn_id, v.destination_user_id, v.studio_platform_id, v.campaign_key, v.is_attached, v.original_udid, v.original_source_user_id, v.original_source_user_sn_id, v.original_destination_user_id, v.original_install_date, v.sys_loading_date, v.campaign_nkey
| Execute on: All Nodes
| +-- Outer -> STORAGE ACCESS for v [Cost: 72K, Rows: 22M] (PATH ID: 2)
| | Projection: dwh.very_big_fact_table_b1
| | Materialize: v.product_id, v.event_date
| | Filter: ((v.event_date >= '2018-07-29'::date) AND (v.event_date <= '2018-08-02'::date))
| | Filter: (v.event_id = ANY (ARRAY[1, 2, 3, 5]))
| | Execute on: All Nodes
| | Runtime Filters: (SIP1(HashJoin): v.product_id), (SIP2(HashJoin): v.event_date), (SIP3(HashJoin): v.product_id, v.event_date)
| +-- Inner -> STORAGE ACCESS for p [Cost: 52, Rows: 27 (NO STATISTICS)] (PATH ID: 3)
| | Projection: dwh.population_table
| | Materialize: p.event_date, dailyp_run.product_id, p.sys_loading_date
| | Filter: ((p.event_date >= '2018-07-29'::date) AND (p.event_date <= '2018-08-02'::date))
| | Execute on: All Nodes``

The WHERE clause is generated dynamically in the ETL process according to population table data, to "help" the optimizer filter the data according to relevant dates.

So because the optimizer "misses" in the Join resulted rows, he preform BROADCAST afterwards.

I tried couple things to workaround this, including:
1. Partition population table like fact table (Year+Week)
2. Changed order by of fact table projection to be aligned to join columns.
3. Removed WHERE clause.

Nothing helped.
Its not clear how the optimiser gets to this result, and how to overcome the bad estimation.

Thanks!

Comments

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Notice the "NO STATISTICS" hint in the explain plan. Do you have stats on the dwh.population_table?

    SELECT analyze_statistics('dwh.population_table');

    See:
    https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/AdministratorsGuide/Statistics/ReactingToStaleStatistics.htm

  • edited August 2018

    I tried that too, the NO STATISTICS was removed from STORAGE ACCESS of dwh.population_table.

    Estimated number of rows remained the same.

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Hi,

    Note that the "rows" count is the number of rows that are actually involved in the join. That number can be lower that the table row count.

    Example:

    dbadmin=> select * from table1;
     c1
    ----
      1
      2
      3
    (3 rows)
    
    dbadmin=> select * from table2;
     c1
    ----
      1
      1
      5
    (3 rows)
    
    dbadmin=> explain select t1.c1, t2.c1 from table1 t1 join table2 t2 on t2.c1 = t1.c1 where t2.c1 = 1 order by t1.c1;
                                                                                                                                      QUERY PLAN                                                            
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     ------------------------------
     QUERY PLAN DESCRIPTION:
     ------------------------------
    
     explain select t1.c1, t2.c1 from table1 t1 join table2 t2 on t2.c1 = t1.c1 where t2.c1 = 1 order by t1.c1;
    
     Access Path:
     +-JOIN MERGEJOIN(inputs presorted) [Cost: 100, Rows: 2] (PATH ID: 2)
     |  Join Cond: (t2.c1 = t1.c1)
     | +-- Outer -> STORAGE ACCESS for t1 [Cost: 7, Rows: 1] (PATH ID: 3)
     | |      Projection: public.table1_super
     | |      Materialize: t1.c1
     | |      Filter: (t1.c1 = 1)
     | |      Runtime Filter: (SIP1(MergeJoin): t1.c1)
     | +-- Inner -> STORAGE ACCESS for t2 [Cost: 92, Rows: 2] (PATH ID: 4)
     | |      Projection: public.table2_super
     | |      Materialize: t2.c1
     | |      Filter: (t2.c1 = 1)
     ------------------------------
    

    In the explain plan the table "tabll1" has 3 rows, but the rows for storage access only shows 1 and the table "table2" has 3 rows, but the explain plan only shows 2 rows.

    See:
    http://www.vertica-forums.com/viewtopic.php?t=2354

  • Yea, make sense, same as in my case. The entire table has 2B rows, but with the filtering its reduced to 22M. What doesn't make sense is the estimated row count after the join.
  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Where are you seeing estimated rows?

  • For the fact table 22M from the explain:

    | +-- Outer -> STORAGE ACCESS for v [Cost: 72K, Rows: 22M] (PATH ID: 2)
  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    That 22M is from the result of the join, with all of the join optimizations. Why is that incorrect?

  • As I understand the result of the join described here:

    Path:
    +-JOIN HASH [Cost: 641K, Rows: 95K (NO STATISTICS)] (PATH ID: 1)

    The output of the join is 95K rows...
  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited August 2018

    That row count is just an estimate. When you generate an explain plan Vertica is not executing the query so you won't get the exact row count. If you have good stats, the estimated rows should be close to the actual row count.

    Example:

    dbadmin=> SELECT COUNT(*) FROM inner;
     COUNT
    -------
         4
    (1 row)
    
    dbadmin=> SELECT COUNT(*) FROM outer;
      COUNT
    ---------
     4194304
    (1 row)
    
    dbadmin=> SELECT COUNT(*) FROM inner join outer ON outer.c2 = inner.c1 AND outer.c1 = inner.c2;
     COUNT
    --------
     504292
    (1 row)
    

    The explain shows an estimated row count for the HASH join to be 671K.

    dbadmin=> EXPLAIN SELECT COUNT(*) FROM inner join outer ON outer.c2 = inner.c1 AND outer.c1 = inner.c2;
                                                                                                                                QUERY PLAN
    
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    ------------------------------------------------------------
     ------------------------------
     QUERY PLAN DESCRIPTION:
     ------------------------------
    
     EXPLAIN SELECT COUNT(*) FROM inner join outer ON outer.c2 = inner.c1 AND outer.c1 = inner.c2;
    
     Access Path:
     +-GROUPBY NOTHING [Cost: 204K, Rows: 1] (PATH ID: 1)
     |  Aggregates: count(*)
     | +---> JOIN HASH [Cost: 203K, Rows: 671K] (PATH ID: 2)
     | |      Join Cond: ("outer".c2 = "inner".c1) AND ("outer".c1 = "inner".c2)
     | | +-- Outer -> STORAGE ACCESS for outer [Cost: 200K, Rows: 4M] (PATH ID: 3)
     | | |      Projection: public.outer_super
     | | |      Materialize: "outer".c2, "outer".c1
     | | |      Runtime Filters: (SIP1(HashJoin): "outer".c2), (SIP2(HashJoin): "outer".c1), (SIP3(HashJoin): "outer".c2, "outer".c1)
     | | +-- Inner -> STORAGE ACCESS for inner [Cost: 66, Rows: 4] (PUSHED GROUPING) (PATH ID: 4)
     | | |      Projection: public.inner_super
     | | |      Materialize: "inner".c1, "inner".c2
     ------------------------------
    
  • Thanks Jim. Does the optimizer uses the estimated rows to decide what should be the next steps in the plan? That is my point, if he "misses" in the hash join estimation, his decisions (for example which table will be inner or outer) might be wrong...

    At the end, regarding my original query that causes issues, I handled it by forcing outer on the big table. The query ran in reasnobale time and didn't throw memory exception.
  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    If you have valid stats on the tables Vertica should pick the correct INNER table in the HASH join. However, according to the docs, "Occasionally, the optimizer might choose the larger table as the inner input to a join." You took the right approach to "control the join inputs" manually.

    See:
    https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/AnalyzingData/Queries/Joins/ForceInnerOuterJoinInputs.htm

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file