Join Big tables (Pre-Proj, Flattened tables, ...)

verban · February 2019

Hello everyone

I have 2 table which are loaded continuously.
There is one column in both table as Foreign key ("ID")
Main table has about 20m records and 2nd one about 2m.
In some cases i need to join 2nd table with 1st table on FK.
The join query takes too long for just 1000 records (more than 1 minute)!!
(bath table has suitable projection based on sorted columns)
Few days ago i created pre-join projection (Vertica 9.0.1) and it was very fast for join,
but i got this error while insert new record:
[Vertica][VJDBC](3815) ERROR: Join inner did not fit in memory
Then i found pre-join was deprecated, it i should use Flattened Table!!
It was fast for query and also insert! But ...
But as both tables are loaded continuously, i should 'refresh_columns' manually and it doesn't rebuild columns for new records automatically!!!

@Jim_Knicely
All comments are welcome,
Thanks

Jim_Knicely · February 2019

Hi,

Did you try the option to set a column DEFAULT to a query-expression in the flattened table?

Example:

dbadmin=> CREATE TABLE dim (c2 INT PRIMARY KEY ENABLED, c3 VARCHAR(10));
CREATE TABLE

dbadmin=> CREATE TABLE fact_flat (c1 INT PRIMARY KEY ENABLED, c2 INT NOT NULL REFERENCES dim(c2), c3 VARCHAR(10) DEFAULT (SELECT c3 FROM dim WHERE c2 = fact_flat.c2));
CREATE TABLE

dbadmin=> INSERT INTO dim SELECT 1, 'TEST1';
 OUTPUT
--------
      1
(1 row)

dbadmin=> INSERT INTO dim SELECT 2, 'TEST2';
 OUTPUT
--------
      1
(1 row)

dbadmin=> INSERT INTO dim SELECT 3, 'TEST3';
 OUTPUT
--------
      1
(1 row)

dbadmin=> INSERT INTO fact_flat (c1, c2) SELECT 1, 1;
 OUTPUT
--------
      1
(1 row)

dbadmin=> INSERT INTO fact_flat (c1, c2) SELECT 2, 3;
 OUTPUT
--------
      1
(1 row)

dbadmin=> SELECT * FROM fact_flat;
 c1 | c2 |  c3
----+----+-------
  1 |  1 | TEST1
  2 |  3 | TEST3
(2 rows)

In the example I didn't have to run REFRESH_COLUMNS to populate the C3 column in the FACT_FLAT table. I just had to make sure the data already existed in the remote (i.e. DIM table) first. Is that not an option in your case?

verban · February 2019

Yes it is, I thought i can't use DEFAULT like this in flattened table.
Thanks,

Jim_Knicely · February 2019

You can!

See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/FlattenedTables/SetUsingVersusDefault.htm

verban · February 2019

I created a table with FK and DEFAULT based on what you said.
But insert into that table is terribly slow;
It takes 7 minutes for insert just 20 records!!

Jim_Knicely · February 2019

Hi,

Here is a quick test:

Fact table:

dbadmin->vmart@sandbox1=> CREATE TABLE fact (c1 INT PRIMARY KEY ENABLED, c1_dim INT NOT NULL REFERENCES dim(c1), c2_dim VARCHAR(100) NOT NULL DEFAULT(SELECT c2 FROM dim WHERE dim.c1 = fact.c1_dim)) ORDER BY c1, c2_dim, c1_dim SEGMENTED BY HASH(c1) ALL NODES;
CREATE TABLE

Dimension table:

dbadmin->vmart@sandbox1=> CREATE TABLE dim (c1 INT PRIMARY KEY ENABLED, c2 VARCHAR(100)) ORDER BY c1 UNSEGMENTED ALL NODES;
CREATE TABLE

I inserted 2 million records into the DIM table with random data for the C2 column to match your row count:

dbadmin->vmart@sandbox1=> SELECT COUNT(*) FROM dim;
  COUNT
---------
 2000000
(1 row)

dbadmin->vmart@sandbox1=>* SELECT * FROM dim WHERE c1 = 1342309;
   c1    |   c2
---------+---------
 1342309 | QOAILPF
(1 row)

Now I'll try to insert a single record into the FACT table:

dbadmin->vmart@sandbox1=>* \timing on
Timing is on.

dbadmin->vmart@sandbox1=>* INSERT /*+ DIRECT */ INTO fact SELECT 1, 1342309;
 OUTPUT
--------
      1
(1 row)

Time: First fetch (1 row): 296.115 ms. All rows formatted: 296.145 ms

dbadmin->vmart@sandbox1=>* SELECT * FROM fact WHERE c1 = 1;
 c1 | c1_dim  | c2_dim
----+---------+---------
  1 | 1342309 | QOAILPF
(1 row)

Time: First fetch (1 row): 166.055 ms. All rows formatted: 166.089 ms

That was pretty quick!

What about 1,999,999 records?

dbadmin->vmart@sandbox1=> INSERT /*+ DIRECT */ INTO fact SELECT c1, c1 FROM dim WHERE c1 >=2;
 OUTPUT
---------
 1999999
(1 row)

Time: First fetch (1 row): 2142.384 ms. All rows formatted: 2142.417 ms

Not too bad!

Did we actually populate the C2_DIM column in the FACT table?

dbadmin->vmart@sandbox1=>* SELECT * FROM fact WHERE c1_dim BETWEEN 123721 AND 123723;
   c1   | c1_dim | c2_dim
--------+--------+--------
 123721 | 123721 | EATG
 123722 | 123722 | BYWGRQ
 123723 | 123723 | A
(3 rows)

Time: First fetch (3 rows): 65.312 ms. All rows formatted: 65.375 ms

dbadmin->vmart@sandbox1=>* SELECT * FROM dim WHERE c1 BETWEEN 123721 AND 123723;
   c1   |   c2
--------+--------
 123721 | EATG
 123722 | BYWGRQ
 123723 | A
(3 rows)

Time: First fetch (3 rows): 67.028 ms. All rows formatted: 67.065 ms

Yup!

How about if the FACT table has 20 million records? Let's find out!

dbadmin->vmart@sandbox1=> SELECT MIN(c1), MAX(c1), COUNT(*) FROM fact;
 MIN |   MAX    |  COUNT
-----+----------+----------
   1 | 20000000 | 20000000
(1 row)

Time: First fetch (1 row): 62.909 ms. All rows formatted: 62.944 ms

Now I'll insert 20 records into the FACT table:

dbadmin->vmart@sandbox1=> INSERT /*+ DIRECT */ INTO fact SELECT ROW_NUMBER() OVER () + 20000000, 1 FROM dim LIMIT 20;
 OUTPUT
--------
     20
(1 row)

Time: First fetch (1 row): 568.442 ms. All rows formatted: 568.473 ms

That took less than 1 second!

Quick check to see if the flattened table look up worked:

dbadmin->vmart@sandbox1=>* SELECT MIN(c2_dim) FROM fact WHERE c1 > 20000000;
  MIN
-------
 ETIIW
(1 row)

Time: First fetch (1 row): 20.065 ms. All rows formatted: 20.096 ms
dbadmin->vmart@sandbox1=>* SELECT * FROM dim WHERE c1 = 1;
 c1 |  c2
----+-------
  1 | ETIIW
(1 row)

Time: First fetch (1 row): 8.139 ms. All rows formatted: 8.168 ms

Yup

verban · February 2019

Thanks for you reply,
This is my experience based on your scenario and my data :

I have two DIM tables (DIM2, DIM10), 1st one have 2M rows and 2nd one 10M rows.
and same DDL:

CREATE TABLE public.DIM2
(
    A1 int NOT NULL,
    A2 timestamp NOT NULL DEFAULT (now())::timestamptz(6),
    A3 timestamp DEFAULT NULL::timestamp,
    A4 int NOT NULL,
    A5 int DEFAULT NULL::int,
    A6 int DEFAULT NULL::int,
    A7 varchar(5) DEFAULT NULL,
    A8 int DEFAULT NULL::int,
    A9 int DEFAULT NULL::int,
    A10 varchar(5) DEFAULT NULL,
    A11 int DEFAULT NULL::int,
    A12 varbinary(255) DEFAULT NULL::varbinary,
    A13 int DEFAULT NULL::int,
    A14 int NOT NULL,
    A15 int NOT NULL,
    A16 int NOT NULL,
    A17 int NOT NULL
);

ALTER TABLE public.DIM2 ADD CONSTRAINT C_PRIMARY PRIMARY KEY (A1) DISABLED;

DIM Tables content

and my fact table is:

CREATE TABLE public.fact2
(
    B1  IDENTITY ,
    B2 int,
    B3 int,
    B4 long varchar(100000),
    B5 int NOT NULL,
    B6 int,
    B7 int,
    B8 int,
    B10 int,
    B11 int,
    B12 timestamp,
    B13 timestamp,
    B14 timestamp,
    B15 int DEFAULT ( SELECT DIM2.A5
 FROM public.DIM2
 WHERE (DIM2.A1 = fact2.B5)),
    B16 int DEFAULT ( SELECT DIM2.A8
 FROM public.DIM2
 WHERE (DIM2.A1 = fact2.B5)),
    B17 varchar(5) DEFAULT ( SELECT DIM2.A7
 FROM public.DIM2
 WHERE (DIM2.A1 = fact2.B5)),
    B18 varchar(5) DEFAULT ( SELECT DIM2.A10
 FROM public.DIM2
 WHERE (DIM2.A1 = fact2.B5))
);


ALTER TABLE public.fact2 ADD CONSTRAINT C_FOREIGN FOREIGN KEY (B3) references public.DIM2 (A1);
ALTER TABLE public.fact2 ADD CONSTRAINT C_FOREIGN_1 FOREIGN KEY (B5) references public.DIM2 (A1);

Insert result:

fact2 with reference to DIM2 (2M Rec)

fact10 with reference to DIM2 (10M Rec)

As you can see in 2nd image, select query from fact_temp is fast enough.
Also i said before that the main table (like DIM) has more than 200M rows and insertion time will takes more time!
So what is the main issue about different between my times and yours and how can fix it?

BR

verban · March 2019

Any comment?
@Jim_Knicely

verban · March 2019

Up!

verban · March 2019

Up! Any comment?

Jim_Knicely · March 2019

Hi,

I'm seeing a performance hit when using a Flattened Table, but nothing to the extent that you are!

I am testing on a 3 node cluster using Vertica 9.2.0-3.

CREATE TABLE public.fact2
(
    B1  IDENTITY ,
    B2 int,
    B3 int,
    B4 long varchar(100000),
    B5 int NOT NULL,
    B6 int,
    B7 int,
    B8 int,
    B9 int,
    B10 int,
    B11 int,
    B12 timestamp,
    B13 timestamp,
    B14 timestamp,
    B15 int DEFAULT ( SELECT DIM2.A5
 FROM public.DIM2
 WHERE (DIM2.A1 = fact2.B5)),
    B16 int DEFAULT ( SELECT DIM2.A8
 FROM public.DIM2
 WHERE (DIM2.A1 = fact2.B5)),
    B17 varchar(5) DEFAULT ( SELECT DIM2.A7
 FROM public.DIM2
 WHERE (DIM2.A1 = fact2.B5)),
    B18 varchar(5) DEFAULT ( SELECT DIM2.A10
 FROM public.DIM2
 WHERE (DIM2.A1 = fact2.B5))
);

ALTER TABLE public.fact2 ADD CONSTRAINT fact2_fk1 FOREIGN KEY (B5) references public.DIM2 (A1);

CREATE PROJECTION public.fact2 /*+createtype(D)*/
(
 B1,
 B2 ENCODING COMMONDELTA_COMP,
 B3 ENCODING COMMONDELTA_COMP,
 B4,
 B5 ENCODING COMMONDELTA_COMP,
 B6,
 B7,
 B8,
 B9,
 B10,
 B11,
 B12,
 B13 ENCODING BLOCKDICT_COMP,
 B14,
 B15,
 B16,
 B17,
 B18
)
AS
 SELECT fact2.B1,
        fact2.B2,
        fact2.B3,
        fact2.B4,
        fact2.B5,
        fact2.B6,
        fact2.B7,
        fact2.B8,
        fact2.B9,
        fact2.B10,
        fact2.B11,
        fact2.B12,
        fact2.B13,
        fact2.B14,
        fact2.B15,
        fact2.B16,
        fact2.B17,
        fact2.B18
 FROM public.fact2
 ORDER BY fact2.B5,
          fact2.B1,
          fact2.B2,
          fact2.B3,
          fact2.B6,
          fact2.B7,
          fact2.B8,
          fact2.B9
SEGMENTED BY hash(fact2.B1, fact2.B2, fact2.B3, fact2.B5, fact2.B6, fact2.B7, fact2.B8, fact2.B9) ALL NODES KSAFE 1;

CREATE TABLE public.DIM2
(
    A1 int NOT NULL,
    A2 timestamp NOT NULL DEFAULT (now())::timestamptz(6),
    A3 timestamp DEFAULT NULL::timestamp,
    A4 int NOT NULL,
    A5 int DEFAULT NULL::int,
    A6 int DEFAULT NULL::int,
    A7 varchar(5) DEFAULT NULL,
    A8 int DEFAULT NULL::int,
    A9 int DEFAULT NULL::int,
    A10 varchar(5) DEFAULT NULL,
    A11 int DEFAULT NULL::int,
    A12 varbinary(255) DEFAULT NULL::varbinary,
    A13 int DEFAULT NULL::int,
    A14 int NOT NULL,
    A15 int NOT NULL,
    A16 int NOT NULL,
    A17 int NOT NULL
);

ALTER TABLE public.DIM2 ADD CONSTRAINT dim2_pk PRIMARY KEY (A1) ENABLED;

CREATE PROJECTION public.DIM2_DBD_1_rep_jim /*+createtype(D)*/
(
 A1 ENCODING COMMONDELTA_COMP,
 A2 ENCODING RLE,
 A3 ENCODING RLE,
 A4 ENCODING DELTAVAL,
 A5 ENCODING DELTAVAL,
 A6 ENCODING DELTAVAL,
 A7,
 A8 ENCODING DELTAVAL,
 A9 ENCODING DELTAVAL,
 A10,
 A11 ENCODING DELTAVAL,
 A12,
 A13 ENCODING DELTAVAL,
 A14 ENCODING DELTAVAL,
 A15 ENCODING DELTAVAL,
 A16 ENCODING DELTAVAL,
 A17 ENCODING DELTAVAL
)
AS
 SELECT DIM2.A1,
        DIM2.A2,
        DIM2.A3,
        DIM2.A4,
        DIM2.A5,
        DIM2.A6,
        DIM2.A7,
        DIM2.A8,
        DIM2.A9,
        DIM2.A10,
        DIM2.A11,
        DIM2.A12,
        DIM2.A13,
        DIM2.A14,
        DIM2.A15,
        DIM2.A16,
        DIM2.A17
 FROM public.DIM2
 ORDER BY DIM2.A1
UNSEGMENTED ALL NODES;

Row count for DIM2:

dbadmin->vmart@sandbox1=> SELECT COUNT(*) FROM dim2;
  COUNT
---------
 2000000
(1 row)

Insert using column defaults for the Flattened Table:

dbadmin->vmart@sandbox1=> insert into fact2 (B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14) SELECT B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14 FROM fact;
 OUTPUT
--------
     23
(1 row)

Time: First fetch (1 row): 3784.094 ms. All rows formatted: 3784.149 ms

Insert using column sub-SELECTs:

dbadmin->vmart@sandbox1=>* insert into fact2 (B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B18) SELECT B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, (SELECT DIM2.A5 FROM dim2 where dim2.a1 = fact.b5), (SELECT DIM2.A8 FROM dim2 where dim2.a1 = fact.b5), (SELECT DIM2.A7 FROM dim2 where dim2.a1 = fact.b5), (SELECT DIM2.A10 FROM dim2 where dim2.a1 = fact.b5) from fact;
 OUTPUT
--------
     23
(1 row)

Time: First fetch (1 row): 3511.827 ms. All rows formatted: 3511.867 ms

Insert using table join:

dbadmin->vmart@sandbox1=> insert into fact2 (B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B18) SELECT B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, DIM2.A5, DIM2.A8, DIM2.A7, DIM2.A10 from fact join dim2 on dim2.a1 = fact.b5;
 OUTPUT
--------
     23
(1 row)

Time: First fetch (1 row): 59.493 ms. All rows formatted: 59.525 ms

That last option is obviously the fastest!

Are you replicating the Dimension table?

Note that Flattened Tables take a big performance hit when the dimension table is large. There is an open Jira to "Improve Flattened table performance on large dim tables". I will keep this tread updated on a release date.

verban · March 2019

Thanks Jim,
I'll create a new test case same as yours (Projection, Encoding, ...) on 3 nodes and let you know about the results.

Join Big tables (Pre-Proj, Flattened tables, ...)

Comments

Leave a Comment