Join Big tables (Pre-Proj, Flattened tables, ...)
Hello everyone
I have 2 table which are loaded continuously.
There is one column in both table as Foreign key ("ID")
Main table has about 20m records and 2nd one about 2m.
In some cases i need to join 2nd table with 1st table on FK.
The join query takes too long for just 1000 records (more than 1 minute)!!
(bath table has suitable projection based on sorted columns)
Few days ago i created pre-join projection (Vertica 9.0.1) and it was very fast for join,
but i got this error while insert new record:[Vertica][VJDBC](3815) ERROR: Join inner did not fit in memory
Then i found pre-join was deprecated, it i should use Flattened Table!!
It was fast for query and also insert! But ...
But as both tables are loaded continuously, i should 'refresh_columns' manually and it doesn't rebuild columns for new records automatically!!!
@Jim_Knicely
All comments are welcome,
Thanks
Comments
Hi,
Did you try the option to set a column DEFAULT to a query-expression in the flattened table?
Example:
dbadmin=> CREATE TABLE dim (c2 INT PRIMARY KEY ENABLED, c3 VARCHAR(10)); CREATE TABLE dbadmin=> CREATE TABLE fact_flat (c1 INT PRIMARY KEY ENABLED, c2 INT NOT NULL REFERENCES dim(c2), c3 VARCHAR(10) DEFAULT (SELECT c3 FROM dim WHERE c2 = fact_flat.c2)); CREATE TABLE dbadmin=> INSERT INTO dim SELECT 1, 'TEST1'; OUTPUT -------- 1 (1 row) dbadmin=> INSERT INTO dim SELECT 2, 'TEST2'; OUTPUT -------- 1 (1 row) dbadmin=> INSERT INTO dim SELECT 3, 'TEST3'; OUTPUT -------- 1 (1 row) dbadmin=> INSERT INTO fact_flat (c1, c2) SELECT 1, 1; OUTPUT -------- 1 (1 row) dbadmin=> INSERT INTO fact_flat (c1, c2) SELECT 2, 3; OUTPUT -------- 1 (1 row) dbadmin=> SELECT * FROM fact_flat; c1 | c2 | c3 ----+----+------- 1 | 1 | TEST1 2 | 3 | TEST3 (2 rows)In the example I didn't have to run REFRESH_COLUMNS to populate the C3 column in the FACT_FLAT table. I just had to make sure the data already existed in the remote (i.e. DIM table) first. Is that not an option in your case?
Yes it is, I thought i can't use DEFAULT like this in flattened table.
Thanks,
You can!
See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/FlattenedTables/SetUsingVersusDefault.htm
I created a table with FK and DEFAULT based on what you said.


But insert into that table is terribly slow;
It takes 7 minutes for insert just 20 records!!
Hi,
Here is a quick test:
Fact table:
Dimension table:
I inserted 2 million records into the DIM table with random data for the C2 column to match your row count:
Now I'll try to insert a single record into the FACT table:
dbadmin->vmart@sandbox1=>* \timing on Timing is on. dbadmin->vmart@sandbox1=>* INSERT /*+ DIRECT */ INTO fact SELECT 1, 1342309; OUTPUT -------- 1 (1 row) Time: First fetch (1 row): 296.115 ms. All rows formatted: 296.145 ms dbadmin->vmart@sandbox1=>* SELECT * FROM fact WHERE c1 = 1; c1 | c1_dim | c2_dim ----+---------+--------- 1 | 1342309 | QOAILPF (1 row) Time: First fetch (1 row): 166.055 ms. All rows formatted: 166.089 msThat was pretty quick!
What about 1,999,999 records?
Not too bad!
Did we actually populate the C2_DIM column in the FACT table?
Yup!
How about if the FACT table has 20 million records? Let's find out!
Now I'll insert 20 records into the FACT table:
dbadmin->vmart@sandbox1=> INSERT /*+ DIRECT */ INTO fact SELECT ROW_NUMBER() OVER () + 20000000, 1 FROM dim LIMIT 20; OUTPUT -------- 20 (1 row) Time: First fetch (1 row): 568.442 ms. All rows formatted: 568.473 msThat took less than 1 second!
Quick check to see if the flattened table look up worked:
Yup
Thanks for you reply,
:
This is my experience based on your scenario and my data
I have two DIM tables (DIM2, DIM10), 1st one have 2M rows and 2nd one 10M rows.
and same DDL:
ALTER TABLE public.DIM2 ADD CONSTRAINT C_PRIMARY PRIMARY KEY (A1) DISABLED;
ALTER TABLE public.DIM2 ADD CONSTRAINT C_PRIMARY PRIMARY KEY (A1) DISABLED;
and my fact table is:
Insert result:
As you can see in 2nd image, select query from fact_temp is fast enough.
Also i said before that the main table (like DIM) has more than 200M rows and insertion time will takes more time!
So what is the main issue about different between my times and yours and how can fix it?
BR
Any comment?
@Jim_Knicely
Up!
Up! Any comment?
Hi,
I'm seeing a performance hit when using a Flattened Table, but nothing to the extent that you are!
I am testing on a 3 node cluster using Vertica 9.2.0-3.
CREATE TABLE public.fact2 ( B1 IDENTITY , B2 int, B3 int, B4 long varchar(100000), B5 int NOT NULL, B6 int, B7 int, B8 int, B9 int, B10 int, B11 int, B12 timestamp, B13 timestamp, B14 timestamp, B15 int DEFAULT ( SELECT DIM2.A5 FROM public.DIM2 WHERE (DIM2.A1 = fact2.B5)), B16 int DEFAULT ( SELECT DIM2.A8 FROM public.DIM2 WHERE (DIM2.A1 = fact2.B5)), B17 varchar(5) DEFAULT ( SELECT DIM2.A7 FROM public.DIM2 WHERE (DIM2.A1 = fact2.B5)), B18 varchar(5) DEFAULT ( SELECT DIM2.A10 FROM public.DIM2 WHERE (DIM2.A1 = fact2.B5)) ); ALTER TABLE public.fact2 ADD CONSTRAINT fact2_fk1 FOREIGN KEY (B5) references public.DIM2 (A1); CREATE PROJECTION public.fact2 /*+createtype(D)*/ ( B1, B2 ENCODING COMMONDELTA_COMP, B3 ENCODING COMMONDELTA_COMP, B4, B5 ENCODING COMMONDELTA_COMP, B6, B7, B8, B9, B10, B11, B12, B13 ENCODING BLOCKDICT_COMP, B14, B15, B16, B17, B18 ) AS SELECT fact2.B1, fact2.B2, fact2.B3, fact2.B4, fact2.B5, fact2.B6, fact2.B7, fact2.B8, fact2.B9, fact2.B10, fact2.B11, fact2.B12, fact2.B13, fact2.B14, fact2.B15, fact2.B16, fact2.B17, fact2.B18 FROM public.fact2 ORDER BY fact2.B5, fact2.B1, fact2.B2, fact2.B3, fact2.B6, fact2.B7, fact2.B8, fact2.B9 SEGMENTED BY hash(fact2.B1, fact2.B2, fact2.B3, fact2.B5, fact2.B6, fact2.B7, fact2.B8, fact2.B9) ALL NODES KSAFE 1; CREATE TABLE public.DIM2 ( A1 int NOT NULL, A2 timestamp NOT NULL DEFAULT (now())::timestamptz(6), A3 timestamp DEFAULT NULL::timestamp, A4 int NOT NULL, A5 int DEFAULT NULL::int, A6 int DEFAULT NULL::int, A7 varchar(5) DEFAULT NULL, A8 int DEFAULT NULL::int, A9 int DEFAULT NULL::int, A10 varchar(5) DEFAULT NULL, A11 int DEFAULT NULL::int, A12 varbinary(255) DEFAULT NULL::varbinary, A13 int DEFAULT NULL::int, A14 int NOT NULL, A15 int NOT NULL, A16 int NOT NULL, A17 int NOT NULL ); ALTER TABLE public.DIM2 ADD CONSTRAINT dim2_pk PRIMARY KEY (A1) ENABLED; CREATE PROJECTION public.DIM2_DBD_1_rep_jim /*+createtype(D)*/ ( A1 ENCODING COMMONDELTA_COMP, A2 ENCODING RLE, A3 ENCODING RLE, A4 ENCODING DELTAVAL, A5 ENCODING DELTAVAL, A6 ENCODING DELTAVAL, A7, A8 ENCODING DELTAVAL, A9 ENCODING DELTAVAL, A10, A11 ENCODING DELTAVAL, A12, A13 ENCODING DELTAVAL, A14 ENCODING DELTAVAL, A15 ENCODING DELTAVAL, A16 ENCODING DELTAVAL, A17 ENCODING DELTAVAL ) AS SELECT DIM2.A1, DIM2.A2, DIM2.A3, DIM2.A4, DIM2.A5, DIM2.A6, DIM2.A7, DIM2.A8, DIM2.A9, DIM2.A10, DIM2.A11, DIM2.A12, DIM2.A13, DIM2.A14, DIM2.A15, DIM2.A16, DIM2.A17 FROM public.DIM2 ORDER BY DIM2.A1 UNSEGMENTED ALL NODES;Row count for DIM2:
Insert using column defaults for the Flattened Table:
dbadmin->vmart@sandbox1=> insert into fact2 (B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14) SELECT B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14 FROM fact; OUTPUT -------- 23 (1 row) Time: First fetch (1 row): 3784.094 ms. All rows formatted: 3784.149 msInsert using column sub-SELECTs:
dbadmin->vmart@sandbox1=>* insert into fact2 (B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B18) SELECT B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, (SELECT DIM2.A5 FROM dim2 where dim2.a1 = fact.b5), (SELECT DIM2.A8 FROM dim2 where dim2.a1 = fact.b5), (SELECT DIM2.A7 FROM dim2 where dim2.a1 = fact.b5), (SELECT DIM2.A10 FROM dim2 where dim2.a1 = fact.b5) from fact; OUTPUT -------- 23 (1 row) Time: First fetch (1 row): 3511.827 ms. All rows formatted: 3511.867 msInsert using table join:
dbadmin->vmart@sandbox1=> insert into fact2 (B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B18) SELECT B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, DIM2.A5, DIM2.A8, DIM2.A7, DIM2.A10 from fact join dim2 on dim2.a1 = fact.b5; OUTPUT -------- 23 (1 row) Time: First fetch (1 row): 59.493 ms. All rows formatted: 59.525 msThat last option is obviously the fastest!
Are you replicating the Dimension table?
Note that Flattened Tables take a big performance hit when the dimension table is large. There is an open Jira to "Improve Flattened table performance on large dim tables". I will keep this tread updated on a release date.
Thanks Jim,
I'll create a new test case same as yours (Projection, Encoding, ...) on 3 nodes and let you know about the results.