We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


UNION ALL and count(*) performance — Vertica Forum

UNION ALL and count(*) performance

dimitri_pdimitri_p
edited April 23 in General Discussion

I have two tables with identical structure:
table1_1
table1_2

and a view1:

select * from table1_1 union all select * from table1_2

select count(1) from table1_1

returns in under a second

select count(1) from table1_2

returns in under a second as well

But

select count(1) from view1

just gets stuck. How is this possible?

Answers

  • That is very possible.
    See query plan, by running query with prepended EXPLAIN.
    Learn how to read and understand query plan.
    Eventually you will get an idea what is going on.

  • dimitri_pdimitri_p
    edited May 2

    Sadly, I had done that before and nothing explained me what happened. Maybe if you could give more concrete an idea (an example) of what exactly in the query plan could have caused it - it would be more substantial. There are no JOIN's or predicates either way, the projections used are the same and the GROUP BY is HASH in both cases. I am more inclined to think that it has to do with memory allocation.

  • marcothesanemarcothesane - Select Field - Administrator

    Well, I ran a test on a one-million-row table.

    These are the first 10 rows:

    SELECT * FROM dbadmin.one_million_rows LIMIT 10;
    -- out  id | id_desc |    dob     | category |        busid         | revenue 
    -- out ----+---------+------------+----------+----------------------+---------
    -- out   0 |       0 | 1950-01-01 |        1 | ====== boss ======== |   0.000
    -- out   1 |      -1 | 1950-01-02 |        2 | kbv-000001kbv-000001 |   0.010 
    -- out   2 |      -2 | 1950-01-03 |        3 | kbv-000002kbv-000002 |   0.020
    -- out   3 |      -3 | 1950-01-04 |        4 | kbv-000003kbv-000003 |   0.030
    -- out   4 |      -4 | 1950-01-05 |        5 | kbv-000004kbv-000004 |   0.040
    -- out   5 |      -5 | 1950-01-06 |        1 | ====== boss ======== |   0.050
    -- out   6 |      -6 | 1950-01-07 |        2 | kbv-000006kbv-000006 |   0.060
    -- out   7 |      -7 | 1950-01-08 |        3 | kbv-000007kbv-000007 |   0.070
    -- out   8 |      -8 | 1950-01-09 |        4 | kbv-000008kbv-000008 |   0.080
    -- out   9 |      -9 | 1950-01-10 |        5 | kbv-000009kbv-000009 |   0.090
    

    I made a copy of that table) using COPY_TABLE():

    SELECT COPY_TABLE('dbadmin.one_million_rows','dbadmin.another_million_rows');
    -- out                                                      COPY_TABLE                                                     
    -- out --------------------------------------------------------------------------------------------------------------------
    -- out  Created table dbadmin.another_million_rows.
    -- out Copied table dbadmin.one_million_rows to dbadmin.another_million_rows
    

    I created a view combining the two:

    CREATE view vw_2_mill_rows AS SELECT * FROM one_million_rows UNION ALL SELECT * FROM another_million_rows;
    -- out CREATE VIEW
    -- out Time: First fetch (0 rows): 92.577 ms. All rows formatted: 92.670 ms
    

    Btw. I'm working on a one-noder, hence no hints as to which nodes the operators run on.

    Then,
    1. I explain the SELECT from one of the tables (I don't need to repeat on the other, as it is identical thanks to COPY_TABLE() ):

    EXPLAIN SELECT COUNT(1) FROM  one_million_rows; 
    -- out  Access Path:
    -- out  +-GROUPBY NOTHING [Cost: 3, Rows: 1] (PATH ID: 1)
    -- out  |  Aggregates: count(1)
    -- out  | +---> STORAGE ACCESS for one_million_rows [Cost: 2, Rows: 1M (1 RLE)] (PATH ID: 2)
    -- out  | |      Projection: dbadmin.one_million_rows_super
    
    1. I explain the select from the view:
    -- out  Access Path:
    -- out  +-GROUPBY NOTHING [Cost: 5K, Rows: 3] (PATH ID: 5)
    -- out  |  Aggregates: count(1)
    -- out  | +---> UNION ALL [Cost: 5K, Rows: 2M] (PATH ID: 7)
    -- out  | | +---> STORAGE ACCESS for one_million_rows [Cost: 2K, Rows: 1M] (PATH ID: 9)
    -- out  | | |      Projection: dbadmin.one_million_rows_super
    -- out  | | |      Materialize: one_million_rows.id
    -- out  | | +---> STORAGE ACCESS for another_million_rows [Cost: 2K, Rows: 1M] (PATH ID: 11)
    -- out  | | |      Projection: dbadmin.another_million_rows_super
    -- out  | | |      Materialize: another_million_rows.id
    

    Note that for the SELECT COUNT(1) on a table, nothing needs materialising, as the row count is usually stored in the table statistics and is picked from there, resulting in an overall cost efficient of 3.

    Note further that in case of a UNION ALL SELECT , it is too risky to pick the row count from the statistics, so the primary key column is materialised on both branches of the UNION ALL SELECT , each resulting in a cost coefficient of 2000; the UNION ALL itself adds another 1000 to the cost coefficient, resulting in 5000, and the top final GROUPBY NOTHING costs something negligible, not modifying the value of 5000.

    So, it's no wonder that I get the following run times:

    marco ~/1/Vertica/supp $ vsql -ic "select count(1) from one_million_rows"
      count  
    ---------
     1000000
    (1 row)
    
    Time: First fetch (1 row): 16.327 ms. All rows formatted: 16.455 ms
    marco ~/1/Vertica/supp $ vsql -ic "select count(1) from vw_2_mill_rows"  
      count  
    ---------
     2000000
    (1 row)
    
    Time: First fetch (1 row): 65.494 ms. All rows formatted: 65.661 ms
    
  • dimitri_pdimitri_p
    edited May 5

    well, in my case one of the tables was blank (not sure if it mattered but I thought I needed to mention it) and the difference was not like in your case - 16 vs 65ms, but it was like less than a second versus forever (didn't bother to wait it out, took more than a couple of mins), which is over a 100 times difference.

  • marcothesanemarcothesane - Select Field - Administrator

    Well, I truncated my another_million_rows and re-explained and re-ran. Same plan, and 2/3 of the original run time, which seems reasonable.

    I could imagine that, depending on projection layout, on existence of primary key or not, on not exactly matching, but only compatible data types in the columns - and with a greater number of columns and a higher row count - it could worsen dramatically.

    I suggest you run an explain and share the output here. Just the block titled Access Path: of the output - and we could find more hints as to the heaviness of the query.

    Remember, a UNION SELECT is a query on a set operation ("Union") of two materialised query result sets. At least Vertica is clever enough to just materialise the primary key columns instead of all columns of both tables. Any late materialisation shortcut that Vertica could take is made impossible by that.

This discussion has been closed.