Options

Slow lookups and data loading

I am using Pentaho Data Integration for loading my facts and dimensions. I have a lot of profile or combination dimensions and SCD Type 2 dimensions. I am bulk loading the fact table which is loading at the rate of 2000 rows per second to ROS. But loading the and looking up into the dimension is around 21 rows per second. I am using just one node but still I expect a faster loading performance.

Thoughts?

Comments

  • Options

    Does your dimension table have sequences/auto-increment?

  • Options
    edited June 2017

    I have made all the Surrogate keys as integers. ETL step increments this number by 1 to the table maximum.

  • Options

    "ETL steps increments the number by 1 to the table maximum." That's your problem right there.

  • Options

    So, what do you suggest?

  • Options

    Well, I can't help without knowing the whole process. But you know where the problem lies. That's a start.

  • Options
    marcothesanemarcothesane - Select Field - Administrator

    Don't do a row by row. As I already commented elsewhere: Almost all ETL tools are very inefficient with their automated SCD maintenance functionality. They want to check row by row whether an incrememtally incoming row has Type 1, Type 2 (maybe other types) SCD columns, and then immediately fire an UPDATE (Type 1) or an UPDATE of the valid_until column and/or the is_current column plus an INSERT with the new values (Type 2) . In the worst of cases, they want to maintain their own surrogate key, or even look up one row for the next surrogate key. This is poison not only for Vertica, but for all dimensional data warehouse platforms with a data volume to be taken seriously. The best workaround is to populate with a mass insert a staging table with identical structure as the SCD table with all affected rows, in the shape they should have at the end (as described above), and finally to run an optimized MERGE. –

  • Options

    Thanks @marcothesane ! I will try this. But my ETL tool was not this inefficient with postgreSQL. Any specific reason?

  • Options

    PostgreSQL is row store. Vertica is column store. Updates are much quicker on a row store. Here is the first link that pops up for me when I search for row store vs column store. http://teachmehana.com/row-store-vs-column-store/. Hope it helps.

  • Options
    marcothesanemarcothesane - Select Field - Administrator

    ... and I would not at all be surprised if the same technique turned out to be faster also in postgres SQL ...

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file