How to deal with right-hand inserts
This is a follow-up question for optimizer team presentation on BDC.
Right-hand inserts happens when table is continuously being populated with ever increasing values in some columns.
For example, table partitioned by date, and each date you adding new day.
Or, column have a time, and you are continuously loading data and value is current time.
Problem is that right-hand insert never have a fixed upper limit on value. For time columns, if you collect statistics that detected min/max value for each column, for time columns result is outdated next minute as new data arrived.
My users querying today's data most. Optimizer detects that date is "value out of range", and predict only few rows will be selected from table. And it is for table that gets loaded almost 100 bln rows per day.
As result, optimizer decided to use broadcast huge table and use it as inner for join. That would be my choice for worst possible execution plan.
I am routinely finding bad queries where optimizer have warning "value out of range" and broadcasting huge table.
Usually it takes me 30 minutes to add hints and force right execution plan. In most cases SQL performance improving around 2 orders of magnitude after manual tuning. Key steps in SQL tuning are preventing broadcasts of huge tables and preventing huge tables to be used as inner in join.
I tried to raise issue with Vertica support, their recommendation was to collect statistics more often. That is a very poor suggestion, as collecting statistics on several petabytes of data is very resource consuming and lasts several hours. Also, tables with size over petabytes are practically not changing, collecting statistics on them again and again are useless resource waste. It would be enough to collect statistics once a year on humongous tables.
Question to optimizer team - any plans to fix this issue, or provide any workaround or config parameter etc.
Issue summary: tables with right-hand inserts never have a valid min/max values on time/date columns, causing optimizer to severely underestimate row count for huge tables and produce very inefficient execution plan.
Answers
The simple answer is to keep the min/max up to date, but where that isn't possible the suggestion that has worked best for me is to only use row count statistics (which are automatically collected). That way it is clear that it's a big table.
There is another approach I'd look into. If you're just trying to avoid bad join orders, and don't want to handle each query separately, there's a FORCE OUTER table setting you can use to make the join order more consistent. See:
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/AnalyzingData/Queries/Joins/ForceInnerOuterJoinInputs.htm
Largest problem is that optimizer do produce optimal execution plan for data that is inside column min/max.
Both suggested solution would work... but both are dumb
Dropping statistics for huge tables is last thing I would do. Most likely it produces significant performance downgrade across the board.
Marking top 20 largest tables as FORCED OUTER would be better than dropping statistics altogether. Still, my estimates that around 1/3 of SQL are hitting data out of value range. Rest 2/3 of SQL are hitting older data, and optimizer do produce good execution plans for them. And, not always huge tables are good for outer source in join.
Though marking FORCED OUTER will work in majority of cases for huge tables, still will be sizeable number of outlier SQL that would be optimized incorrectly.
I do have hundreds of "smaller" tables that are doing right hand inserts. "Smaller" not necessary means they are small. In one database, table size rank no 21 is 2TB, in another table no 21 is 4.3TB (compressed disk size). Manually tuning all FORCE OUTER values would be a tedious task.
And, in my environment 2/3 of large tables are loading live, means value range is never valid for them.
I will implement FORCE OUTER, though it is not the optimal solution.
What I would prefer is to be able to mark column on table as "right-hand insert". That would be all date partitioning columns, and all time/timestamp columns that records current time. Optimizer should ignore "value out of range" for predicates on those columns.
That allow me to provide additional info to optimizer that it cannot deduct from table statistics.
BTW this is the way other leading RDBMS vendor addressed problem.
You could try analyzing statistics by partition, using ANALYZE_STATISTICS_PARTITION. Collecting statistics by partition is less expensive than it is for the table.
This isn't a perfect solution. Collecting the stats is still manual and partition stats do not roll-up into table stats nor can they be used by the optimizer when results span multiple partitions. In other words, partition stats are only used when query predicates select only one partition (e.g. "WHERE date_of_sale = '2020/04/02'" but not "WHERE date_of_sale BETWEEN '2020/04/01' AND '2020/04/01'").
https://www.vertica.com/docs/latest/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_STATISTICS_PARTITION.htm
Thanks for advice, most likely analyze by partition not a usable approach.
Problem is that optimizer is missing part of information about data in table. Correct and long-term solution would be to let users to give this info to optimizer:
select mark_column_right_hand_insert ( 'schema.table.column', true|false );
I will try to use FORCE OUTER, this looks most usable solution. Though definitely will be some queries that will suffer.