We have many millions of rows per partition by day, so let's say we stop partitioning. We use partitions to make it easy to replace the data for a single day in an atomic operation without requiring transactions. Would you suggest that a large transaction to insert millions of rows is performant enough while the table is being simultaneously queried? We'd delete the whole day, insert millions of rows and then commit.
The method you mention is less valid for vertica as delete can be very slow and can create big impacts for select .
My understanding is .
Having say that , you can switch to weekly / monthly partition and load data directly to the table . ( Eg: your monthly partition can be yearMonth like 201601 for january 2016 and you can easily and very fast drop partitions when you no longer need the data ).
General guideline :
I hope you will find it useful
Thanks for the response. We have a slightly different use case.
We're restating data for a given date range every day. We have an unreliable link in getting data into our system, so when data comes in, it contains historical data for the previous 10 days. We recaclulate all 10 days' data and replace it in the database. We parititon by day, inserting the data into a separate table, and then swap the days into the main table. It is convenient that we can drop partitions to dump old data, but I'm more concerned about the inserts. We're inserting many millions of rows per table day.
If we remove the ease of replacing data via paritions, we'd have to delete all the data for a given day, insert the new data, and commit. Whether we could do all 10 days at once or do each day separately is a separate question. Wondering whether it's better to do this than to partition.
As i mention , if you can stay with partition approach without having too many partitions on given time
this should be your preferable approach
Delete & insert approach is less optimized in column based databases (Many overhead - delete vectors /query response time impact and etc )
I hope this answering your question
Is there any difference in delete performance if we're deleting a range of rows, espeically if they are sorted on the key being used for delete? For instance, imagine that the date colume is the first column and we sort on it. A delete of all the rows matching a given date will remove a set of contiguous rows from the DB. Even though we're deleting many rows, is this type of deletion perform better than deleting non-contiguous rows?
Yes ,This for sure will be better than the alternative . delete performance in general are effected by :