Avoid Left Joins

Vertica_User · December 2017

We load .tar files (has 1000's gz files) to vertica staging tables. From staging table, transformations are applied (by left joining several tables) and then loading to Vertica tables.

Sample Query,
Insert into Vertica_Table (col1, col2, col3, col4, col5, col6, col7, col8, col9, col10)
select
STG.col1, TAB2.col2, TAB3.col3,..........................., TAB10.col10
From
Staging_Table STG Left Outer Join
Table_1 TAB1 on (join conditions) Left Outer Join
Table_2 TAB2 on (Join conditions) Left Outer Join
.
.
.
Table_3 TAB3 on (join conditions)

We are using left joins to do transformations / mapping. This query runs for ever. STG (is a huge table 1TB data).

Please let me know if there is any alternate approach to improve the performance. Or can this transformation be done by looping every record in STG (using seperate procedure / UDL etc).

Jim_Knicely · December 2017

Hi,

How big are Table_1, Table_2 ...Table_x compared to the Staging_Table? I'm guessing that the Staging_Table is huge while the other tables are relatively small.

Are you joining Table_1, Table_2 ... Table_x back to Staging_Table? I'm guess you do.

1: Make sure you've defined PKs and FKs
2: Make sure you have up to date stats
3: Make sure the projections for the tables are tuned for your use case

Per 3:

The Staging_Table should be segmented by col1, col2, col3, col4, col5, col6, col7, col8, col9, col10 and Table_1, Table_2, Table_x should all be replicated.

Order the join columns in the Staging_Table table from the lowest to highest carnality and order the Table_1, Table_2 ... Table_x tables by the join column.

Database designer can do that for you...

If you don't want the LEFT joins, you can try this:

Insert into Vertica_Table (stg_col1, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10)
select STG.col1,
(select TAB1.col1 from TAB1 where TABL1.pk = STG.col1fk_1),
(select TAB2.col2 from TAB2 where TABL2.pk = STG.col2fk_2),
(select TAB3.col2 from TAB3 where TABL3.pk = STG.col3fk_3),
...
(select TABx.col2 from TABx where TABLx.pk = STG.colxfk_x)
from Staging_Table STG ;

Vertica_User · December 2017

Hi Jim.
Thanks for your reply.
Please find my answers below,
1. Staging table is the huge one and rest are comparatively smaller.
2. Are you joining Table_1, Table_2 ... Table_x back to Staging_Table? Yes

There is no PK / FK in any of the columns. Will try to alter the table based on that.

Regarding stats-
we are loading the table 4 times on a daily basis. Can I include ANALYZE_STATISTICS for the table in the script after the inset statement. Can you please let me know the best practice.

Do we need to update stats in all the other tables too - like staging, look up table etc.

Jim_Knicely · December 2017

What version of Vertica are you using?

About PKs: Having enabled key constraints, particularly on primary keys, can help the optimizer produce faster query plans, particularly for joins. When a table has an enabled primary key constraint, the optimizer can assume that it has no rows with duplicate values across the key set.

See:
https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Constraints/DecidingWhethertoEnablePRIMARYandUNIQUEKEYConstraints.htm

I would defiantly add the analyze statistics for the big table after the insert.

However, make sure to follow the "Best Practices for Statistics Collection" guidelines here:

https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Statistics/BestPracticesForStatisticsCollection.htm

You probably won't need to analyze the lookup tables as often as their data probably doesn't change that much.