Slow lookups and data loading

I am using Pentaho Data Integration for loading my facts and dimensions. I have a lot of profile or combination dimensions and SCD Type 2 dimensions. I am bulk loading the fact table which is loading at the rate of 2000 rows per second to ROS. But loading the and looking up into the dimension is around 21 rows per second. I am using just one node but still I expect a faster loading performance.

Thoughts?

Comments

  • Does your dimension table have sequences/auto-increment?

  • edited June 2017

    I have made all the Surrogate keys as integers. ETL step increments this number by 1 to the table maximum.

  • "ETL steps increments the number by 1 to the table maximum." That's your problem right there.

  • So, what do you suggest?

  • Well, I can't help without knowing the whole process. But you know where the problem lies. That's a start.

  • marcothesanemarcothesane - Select Field - Administrator

    Don't do a row by row. As I already commented elsewhere: Almost all ETL tools are very inefficient with their automated SCD maintenance functionality. They want to check row by row whether an incrememtally incoming row has Type 1, Type 2 (maybe other types) SCD columns, and then immediately fire an UPDATE (Type 1) or an UPDATE of the valid_until column and/or the is_current column plus an INSERT with the new values (Type 2) . In the worst of cases, they want to maintain their own surrogate key, or even look up one row for the next surrogate key. This is poison not only for Vertica, but for all dimensional data warehouse platforms with a data volume to be taken seriously. The best workaround is to populate with a mass insert a staging table with identical structure as the SCD table with all affected rows, in the shape they should have at the end (as described above), and finally to run an optimized MERGE. –

  • Thanks @marcothesane ! I will try this. But my ETL tool was not this inefficient with postgreSQL. Any specific reason?

  • PostgreSQL is row store. Vertica is column store. Updates are much quicker on a row store. Here is the first link that pops up for me when I search for row store vs column store. http://teachmehana.com/row-store-vs-column-store/. Hope it helps.

  • marcothesanemarcothesane - Select Field - Administrator

    ... and I would not at all be surprised if the same technique turned out to be faster also in postgres SQL ...

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file