Historical Data
We are exploring ways to keep historical data. We have to do point in time analysis. So we plan to do
SELECT SET_CONFIG_PARAMETER('HistoryRetentionTime', '-1');
SELECT SET_CONFIG_PARAMETER('HistoryRetentionEpochs', '-1');
We will do trickle load to vertica from different source systems. Is it feasible to keep history for ever? How will it impact Vertica's performance? From documentation it sounds like it will impact performance. What is the best approach to keep history in vertica to do point in time analysis?
SELECT SET_CONFIG_PARAMETER('HistoryRetentionTime', '-1');
SELECT SET_CONFIG_PARAMETER('HistoryRetentionEpochs', '-1');
We will do trickle load to vertica from different source systems. Is it feasible to keep history for ever? How will it impact Vertica's performance? From documentation it sounds like it will impact performance. What is the best approach to keep history in vertica to do point in time analysis?
0
Comments
By setting -1, my understanding is that you'll disable retention both ways. Most people who do history retention don't use retention by epochs. Because that's kind of arbitrary. Epochs are advanced for every commit in your system. So, if you know how many commits you're doing, you could do it that way, but it's harder to manage. Time is more manageable for most people.
That said, setting the time value to something huge will create some performance overhead, although it is difficult to quantify. Not many people do this, which is why the default value is 0.
Historical queries allow you to query tables prior to deletes occurring, for example. Because deletes don't actually delete anything in the database immediately, you can use that to your advantage for historical queries. It does mean that delete vectors will remain in your system for the duration of the history retention time, however. So, if you set a history retention time of one week, deletes can not be purged until they are at least a week old.
Delete vectors can have performance implications. The more deletes there are, the slower a select * from table is going to work, because it has to "minus" out all the deletes from the delete vector in order to give you a current view.
Too many delete vectors can eventually cause a system to crash. I don't want to go into all those details here, but suffice it to say, it's not pretty when it happens.
The better solution here is to use a type 2 model schema for your data warehouse. Add things like "deleted_flags" to your table to determine whether they've been deleted or not, or retain versions of your data by adding a start/end date column. Now when you want to see what a piece of data looked like, you just add the date field to your query and you get the value at that time. Same functionality, no risk of database crashing. It does obviously create more complexity on the load, etc. But if you want to go back to the beginning of time with a historical query, that's going to be the best way to do it.