Join Big tables (Pre-Proj, Flattened tables, ...)
Hello everyone
I have 2 table which are loaded continuously.
There is one column in both table as Foreign key ("ID")
Main table has about 20m records and 2nd one about 2m.
In some cases i need to join 2nd table with 1st table on FK.
The join query takes too long for just 1000 records (more than 1 minute)!!
(bath table has suitable projection based on sorted columns)
Few days ago i created pre-join projection (Vertica 9.0.1) and it was very fast for join,
but i got this error while insert new record:[Vertica][VJDBC](3815) ERROR: Join inner did not fit in memory
Then i found pre-join was deprecated, it i should use Flattened Table!!
It was fast for query and also insert! But ...
But as both tables are loaded continuously, i should 'refresh_columns' manually and it doesn't rebuild columns for new records automatically!!!
@Jim_Knicely
All comments are welcome,
Thanks
Comments
Hi,
Did you try the option to set a column DEFAULT to a query-expression in the flattened table?
Example:
In the example I didn't have to run REFRESH_COLUMNS to populate the C3 column in the FACT_FLAT table. I just had to make sure the data already existed in the remote (i.e. DIM table) first. Is that not an option in your case?
Yes it is, I thought i can't use DEFAULT like this in flattened table.
Thanks,
You can!
See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/FlattenedTables/SetUsingVersusDefault.htm
I created a table with FK and DEFAULT based on what you said.
But insert into that table is terribly slow;
It takes 7 minutes for insert just 20 records!!
Hi,
Here is a quick test:
Fact table:
Dimension table:
I inserted 2 million records into the DIM table with random data for the C2 column to match your row count:
Now I'll try to insert a single record into the FACT table:
That was pretty quick!
What about 1,999,999 records?
Not too bad!
Did we actually populate the C2_DIM column in the FACT table?
Yup!
How about if the FACT table has 20 million records? Let's find out!
Now I'll insert 20 records into the FACT table:
That took less than 1 second!
Quick check to see if the flattened table look up worked:
Yup
Thanks for you reply,
This is my experience based on your scenario and my data :
I have two DIM tables (DIM2, DIM10), 1st one have 2M rows and 2nd one 10M rows.
and same DDL:
ALTER TABLE public.DIM2 ADD CONSTRAINT C_PRIMARY PRIMARY KEY (A1) DISABLED;
ALTER TABLE public.DIM2 ADD CONSTRAINT C_PRIMARY PRIMARY KEY (A1) DISABLED;
and my fact table is:
Insert result:
As you can see in 2nd image, select query from fact_temp is fast enough.
Also i said before that the main table (like DIM) has more than 200M rows and insertion time will takes more time!
So what is the main issue about different between my times and yours and how can fix it?
BR
Any comment?
@Jim_Knicely
Up!
Up! Any comment?
Hi,
I'm seeing a performance hit when using a Flattened Table, but nothing to the extent that you are!
I am testing on a 3 node cluster using Vertica 9.2.0-3.
Row count for DIM2:
Insert using column defaults for the Flattened Table:
Insert using column sub-SELECTs:
Insert using table join:
That last option is obviously the fastest!
Are you replicating the Dimension table?
Note that Flattened Tables take a big performance hit when the dimension table is large. There is an open Jira to "Improve Flattened table performance on large dim tables". I will keep this tread updated on a release date.
Thanks Jim,
I'll create a new test case same as yours (Projection, Encoding, ...) on 3 nodes and let you know about the results.