Data warehouse modelling on Vertica

markus_zhang · February 2020

Hi experts,

I heard from a lot of people that it's good to have wide, fat tables that have every piece of information in Vertica. My question is: Is it true? And does that mean that we actually don't need to care about snowflake/star schema any more?

The frustration comes when I can't tell "good" flat tables from "bad", "lazy" ones. It seems that our BI tends to use really wide tables with 100 columns, and whenever we want to connect two stuffs (let's say we have a new feature, and we want to connect the purchase of this new feature with the virtual commodities obtained), they will just add a new connection id, and the table becomes one more column wider.

Is this a good practice? I highly doubt this is the case. But I'm not familiar with the internals of Vertica (TBH you guys don't publish enough material/books for this topic, just look at the number of books on say SQL Server, but I guess it's the normal of all big-data stuffs) so I really can't make proper judgement.

What's your thought on this? Is DWH still a thing in Vertica?

Vertica_Curtis · February 2020

Eliminating joins has always been a performance trick in data warehouse systems. That's a Kimball/Inmon thing, not a Vertica thing. But I don't think there's a right or wrong answer here. The correct solution is one that works for you. Flattening a table to reduce joins will make queries fast, but at the cost of complicated ETL processes. Vertica's Flattened table feature can simplify that a great deal by allowing the database to do that work for you, so you don't have to write the code to build that yourself. With columnar storage, like Vertica, there's no penalty for having a wide table, like there is in a row-oriented database. The query cost only includes those columns that are being queried in the SQL. Even a 1,000 column-wide table is OK, since you're unlikely to be selecting all those columns. That's the power of columnar storage. That's not to say that every database should be columnar - row-oriented database have their use-case, and are good at what they do. But for analytics and warehousing, you definitely want to go columnar.

markus_zhang · February 2020

Thanks @Vertica_Curtis for the quick answer. So I guess it's reasonable to have really wide tables after all, at least from engineering perspective.

I think I'm asking the wrong question here, it's not wide tables that are problems, it's the data modelling thing. Essentially the wide table that I talked about are transaction tables, so for example if you make a purchase, the revenue and the virtual items will be in different rows, and I actually have to use a lot of self joins to grab what I want (one join to connect purchase with virtual items, another join to grab player levels, etc,).

So maybe the following is a better summary?
1 - Back in RDMS we use Kimball to build DWH using star schema, now in Vertica we flatten the tables into one wide table
2 - But that doesn't mean that we should use raw transaction records, instead, it's still a DWH, but just in one table with only the necessary information (for example, we could have one wide table for one feature in Vertica, while back in RDMS we will use data mart)

Vertica_Curtis · February 2020

Is this game data? You could consider dumping it all in a staging table, and then doing all the joins there and then writing the finished record (assuming there is a such thing as a finished record), to your target table. You'll want to avoid updates to the target, as that's going to slow you down, especially if you're processing tons and tons of data. In Vertica you'll also want to be mindful of segmentation (unless you're on Eon mode), since that will define which node the data resides on. So, if you've got all the pieces of a transaction residing on different nodes, that could make query performance suffer. You'll probably want to segment on the key that binds everything together.

markus_zhang · February 2020

@Vertica_Curtis Thanks for the reply!

Eddyjohns · July 2023

@Vertica_Curtis said:
Is this game data? You could consider dumping it all in a staging table, and then doing all the joins there and then writing the finished record (assuming there is a such thing as a finished record), to your target table. You'll want to avoid updates to the target, as that's going to slow you down, especially if you're processing tons and tons of data. In Vertica you'll also want to be mindful of segmentation (unless you're on Eon mode), since that will define which node the data resides on. So, if you've got all the pieces of a transaction residing on different nodes, that could make query performance suffer. You'll probably want to segment on the key that binds everything together.

What if a vertica user's want more then this?

marcothesane · July 2023

In general - form follows function. So I'd need to know the nature of the more you're mentioning.

Without any additional info, the safest way is to:

model star models rather than snow flake
use Vertica's flattened tables feature - to keep redundantly the descriptive dimension columns most used in grouping and filtering of the fact table's data - the usual suspects as I call them
Use integers as keys to the dimensions
model Slowly Changing Dimensions using surrogate keys, and combining Type1-s and Type2-s in one table; Type 1 and 2: time precision DATE; Types 4,5, and 6: same type of table, but precision TIMESTAMP(0).
Partition big tables, create background tables with same structure, insert-only into these background tables - from staging and target - and then swap partitions, to avoid delete vectors.

Data warehouse modelling on Vertica

Answers

Leave a Comment