Partitioning and Loading
I have a table where we are loading data from HDFS into a Vertica table, data is stored in parquet with the following partition scheme:
- utc_day - utc_hour - source_system
Source system represents different sources of data (say orders and clicks) with different delay characteristics. We are conforming all of these events into a single fact table schema
My Vertica data is partitioned by utc_time, which is actually just day and hour. It has the following partition grouping:
- older than year, group by year
- older than a month, group by month
- older than a week, group by week
- otherwise by hour
Lets say I'm uploading a file for clicks, the process is:
- delete from schema.fact where source_system = 'clicks' and utc_time = this_hour
- copy the file
- delete from schema.fact where utc_time < retention_period
Furthermore, I update some partitions of data 3 days later
- This seems bad, its generating a bunch of delete vectors, and I need projections tuned to deletes which may not serve my queries
- My hourly partitions also seem subpar, even with my groupings, I'm pushing 300 ros files, 6x the recommendation of 50
One alternative I'm considering is to reload all of the hours data at each load, so we could swap or drop partitions rather than use the row based delete operation.
- When loading data into vertica, is it best practice to load for the whole partition, rather than bits of a partition at a time?
- When deleting data from Vertica, is drop/swap the better practice? I see that this creates an O lock, is that lock fast enough that it doesn't really matter?
- How should I be considering these things alongside hourly partitions for my younger data? Does 300 seem like an excessive number of partitions?