Transactionally reload daily table snapshots
Hi,
we are dealing with a rather common scenario. There are datasources that give us full snapshots of their state every day. We take them and store them in Vertica. We're using 9.2.
We have certain requirements for this process:
1. When a snapshot is updated to a newer one, we'd like it to happen instantly and without any unnecessary locks from the end users (the clients that actually use the tables) perspective.
2. We'd like the process to be safe in case two uploads start at the same time. We expect one of the snapshots to be applied and the other to fail.
3. We'd like to have a general solution for any such table, so as little manual work for each case as possible.
Current thoughts and issues:
The most elegant way to update the table user is seeing would be using SWAP_PARTITIONS_BETWEEN_TABLES
, with some intermediate staging table. This is also mentioned in the documentation several times, as a good general pattern.
Problems
- Staging table cannot be temporary, because it's impossible to create a temp table
LIKE
a real table. Therefore, either a static staging table, or a dynamically created real table are the only options. - Tables that are not partitioned would require creating a single 'dummy' partition. (I'd guess that unpartitioned table is anyway stored identically as a single partition table, so this feels a bit messy)
- There is no
SWAP_ALL_PARTITIONS
orSWAP_TABLES
. This breaks our wish to have a generic solution, because then we need to somehow iterate over an arbitrary list of partitions.
We want to remove old data and insert new into the staging table as a single transaction. This is to prevent issues if accidentally someone starts two such processes. For example, if we use truncate, then both concurrent loads would truncate (which commits transaction) and start loading snapshots, ending in a duplicates. Ideal solution would take an X lock on the table, and perform the entire remove and load operation in one go.
Problems:
- Again, partitions seems to be a good way with
DROP PARTITION
. But as I understand, it commits just likeTRUNCATE
. Moreover, again problems with iterating over partitions instead of sayingDROP_ALL_PARTITIONS
(I guess that's whatTRUNCATE
does?) - The only way to remove data and not commit a transaction is
DELETE
, which is then slow and doesn't suit some of our loads.
I guess using bulk delete is what we're after (according to https://www.vertica.com/kb/Best-Practices-for-Deleting-Data/Content/BestPractices/Best-Practices-for-Deleting-Data.htm#PerformBulkDelete) but there is a severe lack of documentation on how does it work. Is it different from DELETE ... WHERE employee_id = 15
? How would I DELETE all rows from a table?
Any suggestions on how to approach this?
Comments
Here is one option that might cover requirement 1 and 3: