Watch Out for Data Induced “Cross” Joins

Jim_KnicelyJim_Knicely - Select Field - Administrator

I recently worked with a client that reported that a query in Vertica would not return data. That is, the query ran forever. The query was pretty simple, joining two 100 billion+ tables together. After some investigation we realized that the query could potential return trillions and trillions of records! Why? Because the join keys actually performed a COSS JOIN. In this case it turned out to be bad data in one of the tables. I call this a data induced cross join, but it was actually just an unexpected many to many join. The point is, be careful with your join conditions!

Example:

dbadmin=> CREATE TABLE t1 (c INT);
CREATE TABLE

dbadmin=> CREATE TABLE t2 (c INT);
CREATE TABLE

dbadmin=> INSERT INTO t1 SELECT 1;
 OUTPUT
--------
   1
(1 row)

dbadmin=> INSERT INTO t1 SELECT 1;
 OUTPUT
--------
   1
(1 row)

dbadmin=> INSERT INTO t2 SELECT 1;
 OUTPUT
--------
   1
(1 row)

dbadmin=> INSERT INTO t2 SELECT 1;
 OUTPUT
--------
   1
(1 row)

dbadmin=> SELECT t1.*, t2.* FROM t1 JOIN t2 USING (c);
 c | c
---+---
 1 | 1
 1 | 1
 1 | 1
 1 | 1
(4 rows)

dbadmin=> SELECT t1.*, t2.* FROM t1 CROSS JOIN t2;
 c | c
---+---
 1 | 1
 1 | 1
 1 | 1
 1 | 1
(4 rows)

See how we are getting a cross join from the inner join? Imagine the result set of joining two 100+ billion records tables like this!

Helpful Link:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Queries/Joins/CrossJoins.htm

Have fun!

Sign In or Register to comment.