Regular ETL and merge best practices

I'm using an Oracle database with a few hundred tables I want to sync (probably hourly) into Vertica.

I have written a generic syncing tool in Groovy/Java that can do the read/insert inline in JDBC and for appends this is working well.

I'm not sure the best approach for updates. Deletes I don't need at the moment (hopefully never need). One thought was to use an intermediate table to insert everything into then execute a merge statement.

I don't know if creating/dropping 100 temp tables every hours is good practice with Vertica or if this is at all a good idea. I'd like some feedback before I invest more time in that solution.

Comments

  • Hi James,

    I'll let others here speak from actual experience.  But from an architectural standpoint, creating lots of temp tables is exactly how Vertica intends for you to solve this sort of ELT problem.  Load messy data into some temporary space; clean it up; then merge the clean data into your primary tables.

    Vertica is much better at creating and dropping tables than it is at updating and deleting large sets of rows within a table.  In terms of operational costs, creating and dropping 100 tables is like creating and deleting 100 temporary files as part of an ETL process -- they use up a little space while they're there and it costs a few disk I/O's to create them and write to / read from them; but once you delete them, they're gone.

    You might want to look into using LOCAL TEMP tables.  They are accessible only from the current session; we don't have to publish them as globally-visible objects, so creating them is a little cheaper.

    An alternative is to expose the new data via an external table and do a merge directly from the external table.  But that requires setting up an external table, which may not be straightforward in your particular use case.

    Adam
  • Adam,

    Do you have any best practices document for ETL design, Project design, Analytics using Vertica?  or is it available any where in the Vertica site?

    Thanks
    Mrao
  • To follow-up, the tool I am using is now on github https://github.com/japettyjohn/simpleETL.

    As to documents on ETL design etc. for Vertica - I would look at the manual https://my.vertica.com/docs/7.0.x/HTML/index.htm under the Concepts Guide (for how to use Vertica the 'vertica way') and Administrator's guide to understand the implication of data loading/purging.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file