UNION ALL and count(*) performance
I have two tables with identical structure:
table1_1
table1_2
and a view1:
select * from table1_1 union all select * from table1_2
select count(1) from table1_1
returns in under a second
select count(1) from table1_2
returns in under a second as well
But
select count(1) from view1
just gets stuck. How is this possible?
0
This discussion has been closed.
Answers
That is very possible.
See query plan, by running query with prepended EXPLAIN.
Learn how to read and understand query plan.
Eventually you will get an idea what is going on.
Sadly, I had done that before and nothing explained me what happened. Maybe if you could give more concrete an idea (an example) of what exactly in the query plan could have caused it - it would be more substantial. There are no JOIN's or predicates either way, the projections used are the same and the GROUP BY is HASH in both cases. I am more inclined to think that it has to do with memory allocation.
Well, I ran a test on a one-million-row table.
These are the first 10 rows:
I made a copy of that table) using
COPY_TABLE():SELECT COPY_TABLE('dbadmin.one_million_rows','dbadmin.another_million_rows'); -- out COPY_TABLE -- out -------------------------------------------------------------------------------------------------------------------- -- out Created table dbadmin.another_million_rows. -- out Copied table dbadmin.one_million_rows to dbadmin.another_million_rowsI created a view combining the two:
Btw. I'm working on a one-noder, hence no hints as to which nodes the operators run on.
Then,
1. I explain the
SELECTfrom one of the tables (I don't need to repeat on the other, as it is identical thanks toCOPY_TABLE()):Note that for the
SELECT COUNT(1)on a table, nothing needs materialising, as the row count is usually stored in the table statistics and is picked from there, resulting in an overall cost efficient of 3.Note further that in case of a
UNION ALL SELECT, it is too risky to pick the row count from the statistics, so the primary key column is materialised on both branches of theUNION ALL SELECT, each resulting in a cost coefficient of 2000; theUNION ALLitself adds another 1000 to the cost coefficient, resulting in 5000, and the top finalGROUPBY NOTHINGcosts something negligible, not modifying the value of 5000.So, it's no wonder that I get the following run times:
well, in my case one of the tables was blank (not sure if it mattered but I thought I needed to mention it) and the difference was not like in your case - 16 vs 65ms, but it was like less than a second versus forever (didn't bother to wait it out, took more than a couple of mins), which is over a 100 times difference.
Well, I truncated my
another_million_rowsand re-explained and re-ran. Same plan, and 2/3 of the original run time, which seems reasonable.I could imagine that, depending on projection layout, on existence of primary key or not, on not exactly matching, but only compatible data types in the columns - and with a greater number of columns and a higher row count - it could worsen dramatically.
I suggest you run an explain and share the output here. Just the block titled
Access Path:of the output - and we could find more hints as to the heaviness of the query.Remember, a
UNION SELECTis a query on a set operation ("Union") of two materialised query result sets. At least Vertica is clever enough to just materialise the primary key columns instead of all columns of both tables. Any late materialisation shortcut that Vertica could take is made impossible by that.