Loading data to Vertica from multiple computers, data integrity
First, we will describe our solution to import data to Vertica.We import data to Vertica in "batches". A batch can be of arbitrary size, for example 1 hour of processed logs, 4 hours of processed logs, 1 day of processed logs etc. and the data is inserted into various tables (projections).One batch needs to be either imported or failed. We have non-trivial processing of logs and do this in parallel on several machines - we also import processed logs for each batch to Vertica at the same time from all machines.
Currently our approach is:
1. before the batch import starts, we create new intermediary tables
2. we insert the data to Vertica through multiple JDBC connections
3. after the data from all machines is inserted to Vertica (synchronization point) we make a INSERT INTO real_table SELECT * FROM intermediary_table which effectively copies the data to "real" tables (that are backed by pre-join projections)
4. drop original intermediary tables
(Obviously, point 3. above is made in transaction and either succeeds or fails, so we ensure data integrity.)
The problem with these intermediary tables is that the data needs to be copied two times (first to intermediarytables and then from intermediary tables to real tables), and because of that the inserts are slower than they could be.
Is there some other approach that would allow us to guarantee data integrity, but would be faster?
What is the industry practice to do this?
Thanks,Grega
Currently our approach is:
1. before the batch import starts, we create new intermediary tables
2. we insert the data to Vertica through multiple JDBC connections
3. after the data from all machines is inserted to Vertica (synchronization point) we make a INSERT INTO real_table SELECT * FROM intermediary_table which effectively copies the data to "real" tables (that are backed by pre-join projections)
4. drop original intermediary tables
(Obviously, point 3. above is made in transaction and either succeeds or fails, so we ensure data integrity.)
The problem with these intermediary tables is that the data needs to be copied two times (first to intermediarytables and then from intermediary tables to real tables), and because of that the inserts are slower than they could be.
Is there some other approach that would allow us to guarantee data integrity, but would be faster?
What is the industry practice to do this?
Thanks,Grega
0
Comments
This is a very good question.
The easiest (but technically incorrect) solution right now is to open up multiple COPY statements with "COPY ... NO COMMIT;". Then wait until they all complete; then either COMMIT all the transactions or roll them all back.
This can, of course, fail if one of the COMMITs fails. Very unlikely, but not impossible, if your transactions only have COPYs in them. In that case you'd have to clean up after all of the other COPYs. Which you can do in Vertica -- make a note of the current epoch in each transaction that you want to delete; then do a "DELETE FROM <table> WHERE epoch IN (...);".
If you have your own developers, you could build a Vertica extension that does a true transactional parallel load: Write a UDSource (one of our SDK APIs) that connects directly to your application from all nodes at once, to fetch or otherwise receive data to load. Then invoke that UDSource from a single COPY statement. Because it's a single statement on the SQL side, we guarantee that the operation will be transactional -- we will either store all the data that the UDSource gives us, or none of it. I don't think this is a very common approach among users because it requires developing some code; it's not a drop-in SQL solution. But if you're an application developer, it gives you much more flexibility and control over your application's communication with Vertica.
Adam
thanks, you've provided us with a clever idea. The first one (COPY .. NO COMMIT) would potentially work. Custom UDSource would not be the right way to go for us at the moment, as we essentially want to push data to Vertica from our distributed application and not the other way around (pulling from Vertica).
Actually the reason for my question is that the current approach is not fast enough for us as we need to import cca 20 TB of data to Vertica.
@Eli: We need the pre-join projections. Our schema is hand-crafted to achieve very fast query times when joining our fact tables with dimensional tables.