Join Big tables (Pre-Proj, Flattened tables, ...)

verbanverban Registered User
edited February 18 in Vertica Forum

Hello everyone

I have 2 table which are loaded continuously.
There is one column in both table as Foreign key ("ID")
Main table has about 20m records and 2nd one about 2m.
In some cases i need to join 2nd table with 1st table on FK.
The join query takes too long for just 1000 records (more than 1 minute)!!
(bath table has suitable projection based on sorted columns)
Few days ago i created pre-join projection (Vertica 9.0.1) and it was very fast for join,
but i got this error while insert new record:
[Vertica][VJDBC](3815) ERROR: Join inner did not fit in memory
Then i found pre-join was deprecated, it i should use Flattened Table!!
It was fast for query and also insert! But ...
But as both tables are loaded continuously, i should 'refresh_columns' manually and it doesn't rebuild columns for new records automatically!!!

@Jim_Knicely
All comments are welcome,
Thanks

Comments

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert
    edited February 20

    Hi,

    Did you try the option to set a column DEFAULT to a query-expression in the flattened table?

    Example:

    dbadmin=> CREATE TABLE dim (c2 INT PRIMARY KEY ENABLED, c3 VARCHAR(10));
    CREATE TABLE
    
    dbadmin=> CREATE TABLE fact_flat (c1 INT PRIMARY KEY ENABLED, c2 INT NOT NULL REFERENCES dim(c2), c3 VARCHAR(10) DEFAULT (SELECT c3 FROM dim WHERE c2 = fact_flat.c2));
    CREATE TABLE
    
    dbadmin=> INSERT INTO dim SELECT 1, 'TEST1';
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> INSERT INTO dim SELECT 2, 'TEST2';
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> INSERT INTO dim SELECT 3, 'TEST3';
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> INSERT INTO fact_flat (c1, c2) SELECT 1, 1;
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> INSERT INTO fact_flat (c1, c2) SELECT 2, 3;
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> SELECT * FROM fact_flat;
     c1 | c2 |  c3
    ----+----+-------
      1 |  1 | TEST1
      2 |  3 | TEST3
    (2 rows)
    

    In the example I didn't have to run REFRESH_COLUMNS to populate the C3 column in the FACT_FLAT table. I just had to make sure the data already existed in the remote (i.e. DIM table) first. Is that not an option in your case?

  • verbanverban Registered User

    Yes it is, I thought i can't use DEFAULT like this in flattened table.
    Thanks,

  • verbanverban Registered User

    I created a table with FK and DEFAULT based on what you said.
    But insert into that table is terribly slow; :|
    It takes 7 minutes for insert just 20 records!! :o :o

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert
    edited February 26

    Hi,

    Here is a quick test:

    Fact table:

    dbadmin->[email protected]=> CREATE TABLE fact (c1 INT PRIMARY KEY ENABLED, c1_dim INT NOT NULL REFERENCES dim(c1), c2_dim VARCHAR(100) NOT NULL DEFAULT(SELECT c2 FROM dim WHERE dim.c1 = fact.c1_dim)) ORDER BY c1, c2_dim, c1_dim SEGMENTED BY HASH(c1) ALL NODES;
    CREATE TABLE
    

    Dimension table:

    dbadmin->[email protected]=> CREATE TABLE dim (c1 INT PRIMARY KEY ENABLED, c2 VARCHAR(100)) ORDER BY c1 UNSEGMENTED ALL NODES;
    CREATE TABLE
    

    I inserted 2 million records into the DIM table with random data for the C2 column to match your row count:

    dbadmin->[email protected]=> SELECT COUNT(*) FROM dim;
      COUNT
    ---------
     2000000
    (1 row)
    
    dbadmin->[email protected]=>* SELECT * FROM dim WHERE c1 = 1342309;
       c1    |   c2
    ---------+---------
     1342309 | QOAILPF
    (1 row)
    

    Now I'll try to insert a single record into the FACT table:

    dbadmin->[email protected]=>* \timing on
    Timing is on.
    
    dbadmin->[email protected]=>* INSERT /*+ DIRECT */ INTO fact SELECT 1, 1342309;
     OUTPUT
    --------
          1
    (1 row)
    
    Time: First fetch (1 row): 296.115 ms. All rows formatted: 296.145 ms
    
    dbadmin->[email protected]=>* SELECT * FROM fact WHERE c1 = 1;
     c1 | c1_dim  | c2_dim
    ----+---------+---------
      1 | 1342309 | QOAILPF
    (1 row)
    
    Time: First fetch (1 row): 166.055 ms. All rows formatted: 166.089 ms
    

    That was pretty quick!

    What about 1,999,999 records?

    dbadmin->[email protected]=> INSERT /*+ DIRECT */ INTO fact SELECT c1, c1 FROM dim WHERE c1 >=2;
     OUTPUT
    ---------
     1999999
    (1 row)
    
    Time: First fetch (1 row): 2142.384 ms. All rows formatted: 2142.417 ms
    

    Not too bad!

    Did we actually populate the C2_DIM column in the FACT table?

    dbadmin->[email protected]=>* SELECT * FROM fact WHERE c1_dim BETWEEN 123721 AND 123723;
       c1   | c1_dim | c2_dim
    --------+--------+--------
     123721 | 123721 | EATG
     123722 | 123722 | BYWGRQ
     123723 | 123723 | A
    (3 rows)
    
    Time: First fetch (3 rows): 65.312 ms. All rows formatted: 65.375 ms
    
    dbadmin->[email protected]=>* SELECT * FROM dim WHERE c1 BETWEEN 123721 AND 123723;
       c1   |   c2
    --------+--------
     123721 | EATG
     123722 | BYWGRQ
     123723 | A
    (3 rows)
    
    Time: First fetch (3 rows): 67.028 ms. All rows formatted: 67.065 ms
    

    Yup!

    How about if the FACT table has 20 million records? Let's find out!

    dbadmin->[email protected]=> SELECT MIN(c1), MAX(c1), COUNT(*) FROM fact;
     MIN |   MAX    |  COUNT
    -----+----------+----------
       1 | 20000000 | 20000000
    (1 row)
    
    Time: First fetch (1 row): 62.909 ms. All rows formatted: 62.944 ms
    

    Now I'll insert 20 records into the FACT table:

    dbadmin->[email protected]andbox1=> INSERT /*+ DIRECT */ INTO fact SELECT ROW_NUMBER() OVER () + 20000000, 1 FROM dim LIMIT 20;
     OUTPUT
    --------
         20
    (1 row)
    
    Time: First fetch (1 row): 568.442 ms. All rows formatted: 568.473 ms
    

    That took less than 1 second!

    Quick check to see if the flattened table look up worked:

    dbadmin->[email protected]=>* SELECT MIN(c2_dim) FROM fact WHERE c1 > 20000000;
      MIN
    -------
     ETIIW
    (1 row)
    
    Time: First fetch (1 row): 20.065 ms. All rows formatted: 20.096 ms
    dbadmin->[email protected]=>* SELECT * FROM dim WHERE c1 = 1;
     c1 |  c2
    ----+-------
      1 | ETIIW
    (1 row)
    
    Time: First fetch (1 row): 8.139 ms. All rows formatted: 8.168 ms
    

    Yup :)

  • verbanverban Registered User

    Thanks for you reply,
    This is my experience based on your scenario and my data ;) :

    I have two DIM tables (DIM2, DIM10), 1st one have 2M rows and 2nd one 10M rows.
    and same DDL:

    CREATE TABLE public.DIM2
    (
        A1 int NOT NULL,
        A2 timestamp NOT NULL DEFAULT (now())::timestamptz(6),
        A3 timestamp DEFAULT NULL::timestamp,
        A4 int NOT NULL,
        A5 int DEFAULT NULL::int,
        A6 int DEFAULT NULL::int,
        A7 varchar(5) DEFAULT NULL,
        A8 int DEFAULT NULL::int,
        A9 int DEFAULT NULL::int,
        A10 varchar(5) DEFAULT NULL,
        A11 int DEFAULT NULL::int,
        A12 varbinary(255) DEFAULT NULL::varbinary,
        A13 int DEFAULT NULL::int,
        A14 int NOT NULL,
        A15 int NOT NULL,
        A16 int NOT NULL,
        A17 int NOT NULL
    );
    

    ALTER TABLE public.DIM2 ADD CONSTRAINT C_PRIMARY PRIMARY KEY (A1) DISABLED;

    ALTER TABLE public.DIM2 ADD CONSTRAINT C_PRIMARY PRIMARY KEY (A1) DISABLED;

    DIM Tables content

    and my fact table is:

    CREATE TABLE public.fact2
    (
        B1  IDENTITY ,
        B2 int,
        B3 int,
        B4 long varchar(100000),
        B5 int NOT NULL,
        B6 int,
        B7 int,
        B8 int,
        B10 int,
        B11 int,
        B12 timestamp,
        B13 timestamp,
        B14 timestamp,
        B15 int DEFAULT ( SELECT DIM2.A5
     FROM public.DIM2
     WHERE (DIM2.A1 = fact2.B5)),
        B16 int DEFAULT ( SELECT DIM2.A8
     FROM public.DIM2
     WHERE (DIM2.A1 = fact2.B5)),
        B17 varchar(5) DEFAULT ( SELECT DIM2.A7
     FROM public.DIM2
     WHERE (DIM2.A1 = fact2.B5)),
        B18 varchar(5) DEFAULT ( SELECT DIM2.A10
     FROM public.DIM2
     WHERE (DIM2.A1 = fact2.B5))
    );
    
    
    ALTER TABLE public.fact2 ADD CONSTRAINT C_FOREIGN FOREIGN KEY (B3) references public.DIM2 (A1);
    ALTER TABLE public.fact2 ADD CONSTRAINT C_FOREIGN_1 FOREIGN KEY (B5) references public.DIM2 (A1);
    

    Insert result:

    fact2 with reference to DIM2 (2M Rec)

    fact10 with reference to DIM2 (10M Rec)

    As you can see in 2nd image, select query from fact_temp is fast enough.
    Also i said before that the main table (like DIM) has more than 200M rows and insertion time will takes more time!
    So what is the main issue about different between my times and yours and how can fix it?

    BR

  • verbanverban Registered User

    Any comment?
    @Jim_Knicely

  • verbanverban Registered User

    Up!

  • verbanverban Registered User

    Up! Any comment?

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    Hi,

    I'm seeing a performance hit when using a Flattened Table, but nothing to the extent that you are!

    I am testing on a 3 node cluster using Vertica 9.2.0-3.

    CREATE TABLE public.fact2
    (
        B1  IDENTITY ,
        B2 int,
        B3 int,
        B4 long varchar(100000),
        B5 int NOT NULL,
        B6 int,
        B7 int,
        B8 int,
        B9 int,
        B10 int,
        B11 int,
        B12 timestamp,
        B13 timestamp,
        B14 timestamp,
        B15 int DEFAULT ( SELECT DIM2.A5
     FROM public.DIM2
     WHERE (DIM2.A1 = fact2.B5)),
        B16 int DEFAULT ( SELECT DIM2.A8
     FROM public.DIM2
     WHERE (DIM2.A1 = fact2.B5)),
        B17 varchar(5) DEFAULT ( SELECT DIM2.A7
     FROM public.DIM2
     WHERE (DIM2.A1 = fact2.B5)),
        B18 varchar(5) DEFAULT ( SELECT DIM2.A10
     FROM public.DIM2
     WHERE (DIM2.A1 = fact2.B5))
    );
    
    ALTER TABLE public.fact2 ADD CONSTRAINT fact2_fk1 FOREIGN KEY (B5) references public.DIM2 (A1);
    
    CREATE PROJECTION public.fact2 /*+createtype(D)*/
    (
     B1,
     B2 ENCODING COMMONDELTA_COMP,
     B3 ENCODING COMMONDELTA_COMP,
     B4,
     B5 ENCODING COMMONDELTA_COMP,
     B6,
     B7,
     B8,
     B9,
     B10,
     B11,
     B12,
     B13 ENCODING BLOCKDICT_COMP,
     B14,
     B15,
     B16,
     B17,
     B18
    )
    AS
     SELECT fact2.B1,
            fact2.B2,
            fact2.B3,
            fact2.B4,
            fact2.B5,
            fact2.B6,
            fact2.B7,
            fact2.B8,
            fact2.B9,
            fact2.B10,
            fact2.B11,
            fact2.B12,
            fact2.B13,
            fact2.B14,
            fact2.B15,
            fact2.B16,
            fact2.B17,
            fact2.B18
     FROM public.fact2
     ORDER BY fact2.B5,
              fact2.B1,
              fact2.B2,
              fact2.B3,
              fact2.B6,
              fact2.B7,
              fact2.B8,
              fact2.B9
    SEGMENTED BY hash(fact2.B1, fact2.B2, fact2.B3, fact2.B5, fact2.B6, fact2.B7, fact2.B8, fact2.B9) ALL NODES KSAFE 1;
    
    CREATE TABLE public.DIM2
    (
        A1 int NOT NULL,
        A2 timestamp NOT NULL DEFAULT (now())::timestamptz(6),
        A3 timestamp DEFAULT NULL::timestamp,
        A4 int NOT NULL,
        A5 int DEFAULT NULL::int,
        A6 int DEFAULT NULL::int,
        A7 varchar(5) DEFAULT NULL,
        A8 int DEFAULT NULL::int,
        A9 int DEFAULT NULL::int,
        A10 varchar(5) DEFAULT NULL,
        A11 int DEFAULT NULL::int,
        A12 varbinary(255) DEFAULT NULL::varbinary,
        A13 int DEFAULT NULL::int,
        A14 int NOT NULL,
        A15 int NOT NULL,
        A16 int NOT NULL,
        A17 int NOT NULL
    );
    
    ALTER TABLE public.DIM2 ADD CONSTRAINT dim2_pk PRIMARY KEY (A1) ENABLED;
    
    CREATE PROJECTION public.DIM2_DBD_1_rep_jim /*+createtype(D)*/
    (
     A1 ENCODING COMMONDELTA_COMP,
     A2 ENCODING RLE,
     A3 ENCODING RLE,
     A4 ENCODING DELTAVAL,
     A5 ENCODING DELTAVAL,
     A6 ENCODING DELTAVAL,
     A7,
     A8 ENCODING DELTAVAL,
     A9 ENCODING DELTAVAL,
     A10,
     A11 ENCODING DELTAVAL,
     A12,
     A13 ENCODING DELTAVAL,
     A14 ENCODING DELTAVAL,
     A15 ENCODING DELTAVAL,
     A16 ENCODING DELTAVAL,
     A17 ENCODING DELTAVAL
    )
    AS
     SELECT DIM2.A1,
            DIM2.A2,
            DIM2.A3,
            DIM2.A4,
            DIM2.A5,
            DIM2.A6,
            DIM2.A7,
            DIM2.A8,
            DIM2.A9,
            DIM2.A10,
            DIM2.A11,
            DIM2.A12,
            DIM2.A13,
            DIM2.A14,
            DIM2.A15,
            DIM2.A16,
            DIM2.A17
     FROM public.DIM2
     ORDER BY DIM2.A1
    UNSEGMENTED ALL NODES;
    

    Row count for DIM2:

    dbadmin->[email protected]=> SELECT COUNT(*) FROM dim2;
      COUNT
    ---------
     2000000
    (1 row)
    

    Insert using column defaults for the Flattened Table:

    dbadmin->[email protected]=> insert into fact2 (B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14) SELECT B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14 FROM fact;
     OUTPUT
    --------
         23
    (1 row)
    
    Time: First fetch (1 row): 3784.094 ms. All rows formatted: 3784.149 ms
    

    Insert using column sub-SELECTs:

    dbadmin->[email protected]=>* insert into fact2 (B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B18) SELECT B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, (SELECT DIM2.A5 FROM dim2 where dim2.a1 = fact.b5), (SELECT DIM2.A8 FROM dim2 where dim2.a1 = fact.b5), (SELECT DIM2.A7 FROM dim2 where dim2.a1 = fact.b5), (SELECT DIM2.A10 FROM dim2 where dim2.a1 = fact.b5) from fact;
     OUTPUT
    --------
         23
    (1 row)
    
    Time: First fetch (1 row): 3511.827 ms. All rows formatted: 3511.867 ms
    

    Insert using table join:

    dbadmin->[email protected]=> insert into fact2 (B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B18) SELECT B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, DIM2.A5, DIM2.A8, DIM2.A7, DIM2.A10 from fact join dim2 on dim2.a1 = fact.b5;
     OUTPUT
    --------
         23
    (1 row)
    
    Time: First fetch (1 row): 59.493 ms. All rows formatted: 59.525 ms
    

    That last option is obviously the fastest!

    Are you replicating the Dimension table?

    Note that Flattened Tables take a big performance hit when the dimension table is large. There is an open Jira to "Improve Flattened table performance on large dim tables". I will keep this tread updated on a release date.

  • verbanverban Registered User

    Thanks Jim,
    I'll create a new test case same as yours (Projection, Encoding, ...) on 3 nodes and let you know about the results.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file