Watch Out for Data Induced “Cross” Joins
I recently worked with a client that reported that a query in Vertica would not return data. That is, the query ran forever. The query was pretty simple, joining two 100 billion+ tables together. After some investigation we realized that the query could potential return trillions and trillions of records! Why? Because the join keys actually performed a COSS JOIN. In this case it turned out to be bad data in one of the tables. I call this a data induced cross join, but it was actually just an unexpected many to many join. The point is, be careful with your join conditions!
Example:
dbadmin=> CREATE TABLE t1 (c INT); CREATE TABLE dbadmin=> CREATE TABLE t2 (c INT); CREATE TABLE dbadmin=> INSERT INTO t1 SELECT 1; OUTPUT -------- 1 (1 row) dbadmin=> INSERT INTO t1 SELECT 1; OUTPUT -------- 1 (1 row) dbadmin=> INSERT INTO t2 SELECT 1; OUTPUT -------- 1 (1 row) dbadmin=> INSERT INTO t2 SELECT 1; OUTPUT -------- 1 (1 row) dbadmin=> SELECT t1.*, t2.* FROM t1 JOIN t2 USING (c); c | c ---+--- 1 | 1 1 | 1 1 | 1 1 | 1 (4 rows) dbadmin=> SELECT t1.*, t2.* FROM t1 CROSS JOIN t2; c | c ---+--- 1 | 1 1 | 1 1 | 1 1 | 1 (4 rows)
See how we are getting a cross join from the inner join? Imagine the result set of joining two 100+ billion records tables like this!
Helpful Link:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Queries/Joins/CrossJoins.htm
Have fun!