How to add a new column to a table efficiently?

yossibmyossibm Vertica Customer

I have a large table (~1B rows, ~50 columns) on a single node (no projections, no partition or segmentation), and data for a new column with the same exact keys and same order.
My use case is to repeatedly add such new columns.
After adding an empty column to the table, I need to write the column data.
I have tried to use Update directly and it took many hours.
I have tried to create a new temp table and then do a Merge. Although writing was efficient, the merge took hours and it also required to run a TupleMover mergeout before doing the next merge (without doing so, the next merge would be extremely slow).
I have tried to Join the temp table and the original table into a new table, but it was also very slow, especially after the original table grew wider.
Is there another more efficient solution - regarding that I have the data sorted and with the same primary keys? maybe a low level solution?
Thanks,
Yossi

Best Answer

Answers

  • marcothesanemarcothesane - Select Field - Administrator

    Hi @yossibm -
    Why don't you try something like:

    ALTER TABLE mytable ADD monthgap INTEGER DEFAULT MONTHS_BETWEEN(end_date,start_date);
    

    , instead of adding the new column first and updating it after?
    I have always regarded this possibility as one of the real perks of a columnar database ...
    Good luck
    Marco

  • yossibmyossibm Vertica Customer

    Hi,
    Thanks for the suggestion.
    Unfortunately I can't use something like that because the data is being generated by another process and cannot be calculated directly from existing columns.
    Also, I haven't found that syntax in the documentation.
    Kind regards,
    Yossi

  • yossibmyossibm Vertica Customer

    Hi,
    Thank you for that comprehensive solution!
    The example works for me, but I noticed that the original "helper" table is referenced by the "mickey" table.
    I have tried in the past to use Select with Join, to avoid the slow merge/update operation,
    but the Join caused the Select runtime to be very unpredictable: some queries even took hours, without any obvious reason.
    Can I trust that method to be more efficient / predictable than using Join ?
    Is it possible to copy the column content as to avoid the reference?
    Thanks and best regards,
    Yossi

  • marcothesanemarcothesane - Select Field - Administrator

    The Join is efficient when both tables are segmented and ordered by the join column.
    If the super-projection is not already designed in the right way (and you can check that by running: SELECT EXPORT_OBJECTS('','mickey',FALSE); ) , then create projections on both sides that support the join optimally:

    CREATE PROJECTION mickey_join_helper
    AS SELECT * FROM mickey
    ORDER BY id
    SEGMENTED BY HASH(id) ALL NODES;
    CREATE PROJECTION helper_join_mickey
    AS SELECT * FROM helper
    ORDER BY id
    SEGMENTED BY HASH(id) ALL NODES;
    

    You might also want to run SELECT ANALYZE_STATISTICS() on both tables before you add that column.

  • yossibmyossibm Vertica Customer

    After preliminary testing it looks like it is working without overhead.
    Thank you @marcothesane !

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file