How to speed up multiple join query

HyeontaeJu · August 2020

The Query is this
SELECT
*
FROM
products a
LEFT OUTER JOIN products_image b ON
a.product_id = b.product_id
LEFT OUTER JOIN ws_product_brand_map c ON
a.product_id = c.product_id
LEFT OUTER JOIN ws_product_category_map d ON
a.product_id = d.product_id
LEFT OUTER JOIN ws_category e ON
d.cate_mst_cd = e.master_code

and query plan is this
Access Path:
+-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 17M, Rows: 44M] (PATH ID: 1)
| Join Cond: (a.product_id = d.product_id)
| +-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4M, Rows: 44M] (PATH ID: 2)
| | Join Cond: (a.product_id = b.product_id)
| | Materialize at Input: a.product_id
| | +-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 3M, Rows: 44M] (PATH ID: 3)
| | | Join Cond: (a.product_id = c.product_id)
| | | +-- Outer -> STORAGE ACCESS for a [Cost: 460K, Rows: 44M] (PATH ID: 4)
| | | | Projection: newbiz_product_dev.products_super
| | | | Materialize: a.product_id
| | | +-- Inner -> STORAGE ACCESS for c [Cost: 2M, Rows: 44M] (PATH ID: 5)
| | | | Projection: newbiz_product_dev.ws_product_brand_map_super
| | | | Materialize: c.product_id, c.mall_id, c.pipe_id, c.brand_id, c.domain_brand, c.brand, c.brand_status, c.brand_ctrl_disp_yn, c.reg_user, c.reg_dt, c.mod_user, c.mod_dt
| | +-- Inner -> STORAGE ACCESS for b [Cost: 474K, Rows: 4M] (PATH ID: 6)
| | | Projection: newbiz_product_dev.products_image_super
| | | Materialize: b.product_id, b.mall_id, b.pipe_id, b.image_link, b.cdn_image_path, b.cdn_image_name, b.cdn_image_name_orig, b.image_status, b.clustered_yn, b.reg_images, b.file_checksum, b.collect_time, b.registerd_time, b.update_time
| +-- Inner -> JOIN HASH [LeftOuter] [Cost: 65K, Rows: 32M] (PATH ID: 7)
| | Join Cond: (d.cate_mst_cd = e.master_code)
| | +-- Outer -> STORAGE ACCESS for d [Cost: 40K, Rows: 32M] (PATH ID: 8)
| | | Projection: newbiz_product_dev.ws_product_category_map_super
| | | Materialize: d.cate_mst_cd
| | +-- Inner -> STORAGE ACCESS for e [Cost: 482, Rows: 5K] (PATH ID: 9)
| | | Projection: newbiz_product_dev.ws_category_super
| | | Materialize: e.master_code, e."1depth_code", e."1depth_name", e."2depth_code", e."2depth_name", e."3depth_code", e."3depth_name", e."4depth_code", e."4depth_name", e."5depth_code", e."5depth_name", e.category_text, e.use_yn, e.mod_date

plz help me

Navin_C · August 2020

Based on the query and plan (along with statistics)
Some generic optimization rules .
1. Always reduce the data before any joins as much possible.
2. When joining, make sure smaller tables are on the left side of join syntax, which makes this data set to be in memory / broadcasted to all the vertica nodes and makes join faster.
3. Join on INT columns, preferred over any other types, it makes it faster.
4. Select on the columns which are really needed. -- Reduce materializing of columns
5. Make sure stats are updated on the table.

In this case, one of your joins is assuming the huge table in the INNER part of join, which can ask for more memory and reduce the execution speed.
-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 3M, Rows: 44M] (PATH ID: 3)
| | | Join Cond: (a.product_id = c.product_id)
| | | +-- Outer -> STORAGE ACCESS for a [Cost: 460K, Rows: 44M] (PATH ID: 4)
| | | | Projection: newbiz_product_dev.products_super
| | | | Materialize: a.product_id
| | | +-- Inner -> STORAGE ACCESS for c [Cost: 2M, Rows: 44M] (PATH ID: 5)
| | | | Projection: newbiz_product_dev.ws_product_brand_map_super
| | | | Materialize: c.product_id, c.mall_id, c.pipe_id, c.brand_id, c.domain_brand, c.brand, c.brand_status, c.brand_ctrl_disp_yn, c.reg_user, c.reg_dt, c.mod_user, c.mod_dt

HyeontaeJu · August 2020

@Navin_C Thank you very much,,, all table's data is very big,,
products : 40,000,000
ws_product_brand_map : 40,000,000
ws_product_category_map : 40,000,000
ws_category = 4,000
so,, In this case How can i reduce the response time ??

Sankarmn · August 2020

@HyeontaeJu , Optimal performance depends in great part on the projections that are available for a given query. Check best practices for joins, https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/OptimizingJOINQueries.htm?tocpath=Analyzing Data|Query Optimization|JOIN Queries|_____0

marcothesane · August 2020

You have a great deal of MERGEJOIN operators - this is very good and can't be beaten.

None of the JOIN operators show RESEGMENT or BROADCAST or ROUND-ROBIN, so you have segmented your table's data well.

But:
What are you running your query with?
According to what the statistics say, you are fetching 44 million rows from Vertica to the client.

To avoid the data fetching process, just try:

CREATE LOCAL TEMPORARY TABLE dropme
ON COMMIT PRESERVE ROWS AS
-- your full query
SELECT
*
FROM
products a
LEFT OUTER JOIN products_image b ON
a.product_id = b.product_id
LEFT OUTER JOIN ws_product_brand_map c ON
a.product_id = c.product_id
LEFT OUTER JOIN ws_product_category_map d ON
a.product_id = d.product_id
LEFT OUTER JOIN ws_category e ON
d.cate_mst_cd = e.master_code
-- your full query ends here
KSAFE 0
;

.. and measure the time when the data does not have to be prepared for sending to client, and the client has no data to fetch.

You seriously risk to test the network speed rather than the database.

We're Moving!

Create My New Community Account Now

How to speed up multiple join query

Answers

Leave a Comment