Slow lookups and data loading

nkarkare · June 2017

I am using Pentaho Data Integration for loading my facts and dimensions. I have a lot of profile or combination dimensions and SCD Type 2 dimensions. I am bulk loading the fact table which is loading at the rate of 2000 rows per second to ROS. But loading the and looking up into the dimension is around 21 rows per second. I am using just one node but still I expect a faster loading performance.

Thoughts?

aurorak · June 2017

Does your dimension table have sequences/auto-increment?

nkarkare · June 2017

I have made all the Surrogate keys as integers. ETL step increments this number by 1 to the table maximum.

aurorak · June 2017

"ETL steps increments the number by 1 to the table maximum." That's your problem right there.

nkarkare · June 2017

So, what do you suggest?

aurorak · June 2017

Well, I can't help without knowing the whole process. But you know where the problem lies. That's a start.

marcothesane · June 2017

Don't do a row by row. As I already commented elsewhere: Almost all ETL tools are very inefficient with their automated SCD maintenance functionality. They want to check row by row whether an incrememtally incoming row has Type 1, Type 2 (maybe other types) SCD columns, and then immediately fire an UPDATE (Type 1) or an UPDATE of the valid_until column and/or the is_current column plus an INSERT with the new values (Type 2) . In the worst of cases, they want to maintain their own surrogate key, or even look up one row for the next surrogate key. This is poison not only for Vertica, but for all dimensional data warehouse platforms with a data volume to be taken seriously. The best workaround is to populate with a mass insert a staging table with identical structure as the SCD table with all affected rows, in the shape they should have at the end (as described above), and finally to run an optimized MERGE. –

nkarkare · June 2017

Thanks @marcothesane ! I will try this. But my ETL tool was not this inefficient with postgreSQL. Any specific reason?

aurorak · June 2017

PostgreSQL is row store. Vertica is column store. Updates are much quicker on a row store. Here is the first link that pops up for me when I search for row store vs column store. http://teachmehana.com/row-store-vs-column-store/. Hope it helps.

marcothesane · June 2017

... and I would not at all be surprised if the same technique turned out to be faster also in postgres SQL ...

We're Moving!

Create My New Community Account Now

Slow lookups and data loading

Comments

Leave a Comment