Partitioning and Loading

mferrall · May 2023

Hi everyone,
I have a table where we are loading data from HDFS into a Vertica table, data is stored in parquet with the following partition scheme:

- utc_day
  - utc_hour
    - source_system

Source system represents different sources of data (say orders and clicks) with different delay characteristics. We are conforming all of these events into a single fact table schema

My Vertica data is partitioned by utc_time, which is actually just day and hour. It has the following partition grouping:

older than year, group by year
older than a month, group by month
older than a week, group by week
otherwise by hour

Lets say I'm uploading a file for clicks, the process is:

delete from schema.fact where source_system = 'clicks' and utc_time = this_hour
copy the file
delete from schema.fact where utc_time < retention_period

Furthermore, I update some partitions of data 3 days later

Observations

This seems bad, its generating a bunch of delete vectors, and I need projections tuned to deletes which may not serve my queries
My hourly partitions also seem subpar, even with my groupings, I'm pushing 300 ros files, 6x the recommendation of 50

One alternative I'm considering is to reload all of the hours data at each load, so we could swap or drop partitions rather than use the row based delete operation.

Some questions

When loading data into vertica, is it best practice to load for the whole partition, rather than bits of a partition at a time?
When deleting data from Vertica, is drop/swap the better practice? I see that this creates an O lock, is that lock fast enough that it doesn't really matter?
How should I be considering these things alongside hourly partitions for my younger data? Does 300 seem like an excessive number of partitions?

Vertica_Curtis · May 2023

300 partitions probably isn't a huge number of partitions. But certainly less is generally better. I tend to believe the documentation takes a very conservative stance here. Having a projection which just contained the primary key values, ordered by source_system and utc_time might improve your delete performance by a lot.

And yes, drop_partition is always more advantageous than doing a delete statement. It's magnitudes faster and doesn't create delete vectors. Though I'm not really sure it's a viable option in this case. You might also consider a daily rollup after 3 days, and then week/month/year. That would cut the number of hourly partitions in half.

It might also be worthwhile to ask what it would look like if you didn't delete the data at all. Is there any value in keeping it around - to see what it was prior to being deleted? You could just load it all with a new timestamp. Vertica's TOP-K projections can be used to always select out the most recent record. So, instead of reading from the table, you read from the top-K projection, and that gives you an instantaneous "view" of each record's most recent value. So, queries against the table don't have to be cumbersome, and you have a lineage of all the history to boot.

Vertica_Curtis · May 2023

I'm not sure I fully understand the inner workings of how deletes work, but I've heard anecdotally that having a projection that supports it can help. All projections do delete the values, but there is cross-communication between projections for this process. So, if you have a table that has columns A-Z in it, and a projection that only has columns A-L in it, and you do a delete where Z = 10, that second projection doesn't even have Z. It will still get deletes applied to it. The super projection will be used to inform that projection which rows those are within a context that projection can understand. So long as the projection that drives this process has enough information in it that it can convey that accurately, it should still be efficient. At any rate, give it a try, and see if it helps. It shouldn't need to be a super projection, but you could try it either way.

All projections on a table share the partitioning scheme of the table. Partitioning is at the table level - you can't partition projections in a different way.

mferrall · May 2023

@Vertica_Curtis Thank you! Does having a delete oriented projection improve deletes on the non-delete oriented projection as well? To me, that implies there must either be some mapping from one projection to another, or that the delete oriented one is 'leading' in some sense, where it is updated first and other projections would follow it. Is that about correct?

Regarding a 'write only' approach, yes, that would be my preferred approach. It's not something we've done before, but I'll look into it. We've avoided LAP's and other more complex projections as they weren't compatible with deletes in 9.x IIRC, I think we are on 11 now though. Do Top K projections share a partitioning scheme with their base projection?

Again, thank you for the input!

Vertica_Curtis · May 2023

There is a sort of mapping between between projections, or at least some algorithm that ties them together in order to facilitate deletes. I would imagine that if a projection were problematic, it will be problematic regardless of the efficiency of other projections, but I could be wrong. You can also try the helper function "evaluate_delete_performance()". It's kind of an oddball function, but it should show you whether there could be delete concerns on a projection. It's mostly looking for very high cardinality order by clauses, I think. So, if I had a table that stored people, and it was sorted by some unique "people_id" column, but then I created a projection that only had a handful of columns in it - hair_color and eye_color, I'd end up with a lot of duplicate "brown, brown" entries for that projection. If I deleted from the table "where people_id = 123" - the secondary projection wouldn't actually know which specific row to delete in order to satisfy that criteria. Which brown-eyed, brown-haired person are you referring to here? Evaluate_delete_performance would show that yes - there are concerns with that projection, since it can't easily identify a single tuple.

Incidentally, in that example, Vertica would literally just end up deleting a random "brown, brown" record from the second projection, because ultimately it wouldn't matter which one was removed - only that the counts were consistent!

Partitioning and Loading

Best Answers

Answers

Leave a Comment