How to add a new column to a table efficiently?

yossibm · April 2020

I have a large table (~1B rows, ~50 columns) on a single node (no projections, no partition or segmentation), and data for a new column with the same exact keys and same order.
My use case is to repeatedly add such new columns.
After adding an empty column to the table, I need to write the column data.
I have tried to use Update directly and it took many hours.
I have tried to create a new temp table and then do a Merge. Although writing was efficient, the merge took hours and it also required to run a TupleMover mergeout before doing the next merge (without doing so, the next merge would be extremely slow).
I have tried to Join the temp table and the original table into a new table, but it was also very slow, especially after the original table grew wider.
Is there another more efficient solution - regarding that I have the data sorted and with the same primary keys? maybe a low level solution?
Thanks,
Yossi

marcothesane · April 2020

Hi @yossibm
The first documentation on the DEFAULT clause in a column definition can be found here:
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Statements/column-constraint.htm
Scroll to "DEFAULT" in the column constraint tabular list.
So you need to derive the content of your new column from sources outside of the current table.
Then you need to check this part - "Flattened Tables" in the Vertica documentation:
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/AnalyzingData/FlattenedTables/FlattenedTables.htm?zoom_highlight=flattened table
Here, we introduce the concept that you can use also a single-row-single-column select as an expression for a column DEFAULT.
Let me put together a Mickey Mouse example:

-- this is the base table
-- as yet without the column with the complex calculation
CREATE TABLE mickey (
  id         BIGINT      NOT NULL
, first_name VARCHAR(32)
, last_name  VARCHAR(32)
, hdate      DATE
);
-- out CREATE TABLE

-- insert some data ...
INSERT /*+ DIRECT */ INTO mickey(id,first_name,last_name,hdate)
          SELECT  1,'Arthur','Dent'           ,DATE '2010-04-01'
UNION ALL SELECT  2,'Ford','Prefect'          ,DATE '2011-04-01'
UNION ALL SELECT  3,'Zaphod','Beeblebrox'     ,DATE '2012-04-01'
UNION ALL SELECT  4,'Tricia','McMillan'       ,DATE '2013-04-01'
UNION ALL SELECT  5,'Gag','Halfrunt'          ,DATE '2014-04-01'
UNION ALL SELECT  6,'Prostetnic Vogon','Jeltz',DATE '2015-04-01'
UNION ALL SELECT  7,'Lionel','Prosser'        ,DATE '2016-04-01'
UNION ALL SELECT  8,'Benji','Mouse'           ,DATE '2017-04-01'
UNION ALL SELECT  9,'Frankie','Mouse'         ,DATE '2018-04-01'
UNION ALL SELECT 10,'Svlad','Cjelli'          ,DATE '2019-04-01'
;
-- out  OUTPUT 
-- out --------
-- out      10

-- this is the helper table 
-- to contain the complex-calculated value 
-- needed finally in the "mickey" table
CREATE TABLE helper (
  id            INT
, months_worked INT
);
-- out CREATE TABLE

-- this is to mimick the complex ETL calculation
INSERT INTO helper
SELECT
  id
, MONTHS_BETWEEN(CURRENT_DATE,hdate)
FROM mickey;
-- out  OUTPUT 
-- out --------
-- out      10
COMMIT; 
-- out COMMIT

-- check content of helper table
SELECT * FROM helper;
-- out  id | months_worked 
-- out ----+---------------
-- out   1 |           121
-- out   2 |           109
-- out   3 |            97
-- out   4 |            85
-- out   5 |            73
-- out   6 |            61
-- out   7 |            49
-- out   8 |            37
-- out   9 |            25
-- out  10 |            13
-- out (10 rows)
-- out 

-- Now - fix the "mickey" table:
ALTER TABLE mickey
ADD months_worked INTEGER DEFAULT(SELECT months_worked FROM helper WHERE id=mickey.id);
-- out ALTER TABLE
--check the contents of the "mickey" table now ...
SELECT * FROM mickey;
-- out Time: First fetch (0 rows): 120.181 ms. All rows formatted: 120.244 ms
-- out  id |    first_name    | last_name  |   hdate    | months_worked 
-- out ----+------------------+------------+------------+---------------
-- out   1 | Arthur           | Dent       | 2010-04-01 |           121
-- out   2 | Ford             | Prefect    | 2011-04-01 |           109
-- out   3 | Zaphod           | Beeblebrox | 2012-04-01 |            97
-- out   4 | Tricia           | McMillan   | 2013-04-01 |            85
-- out   5 | Gag              | Halfrunt   | 2014-04-01 |            73
-- out   6 | Prostetnic Vogon | Jeltz      | 2015-04-01 |            61
-- out   7 | Lionel           | Prosser    | 2016-04-01 |            49
-- out   8 | Benji            | Mouse      | 2017-04-01 |            37
-- out   9 | Frankie          | Mouse      | 2018-04-01 |            25
-- out  10 | Svlad            | Cjelli     | 2019-04-01 |            13
-- out (10 rows)

This should work, shouldn't it?

marcothesane · April 2020

Hi @yossibm -
Why don't you try something like:

ALTER TABLE mytable ADD monthgap INTEGER DEFAULT MONTHS_BETWEEN(end_date,start_date);

, instead of adding the new column first and updating it after?
I have always regarded this possibility as one of the real perks of a columnar database ...
Good luck
Marco

yossibm · April 2020

Hi,
Thanks for the suggestion.
Unfortunately I can't use something like that because the data is being generated by another process and cannot be calculated directly from existing columns.
Also, I haven't found that syntax in the documentation.
Kind regards,
Yossi

yossibm · April 2020

Hi,
Thank you for that comprehensive solution!
The example works for me, but I noticed that the original "helper" table is referenced by the "mickey" table.
I have tried in the past to use Select with Join, to avoid the slow merge/update operation,
but the Join caused the Select runtime to be very unpredictable: some queries even took hours, without any obvious reason.
Can I trust that method to be more efficient / predictable than using Join ?
Is it possible to copy the column content as to avoid the reference?
Thanks and best regards,
Yossi

marcothesane · April 2020

The Join is efficient when both tables are segmented and ordered by the join column.
If the super-projection is not already designed in the right way (and you can check that by running: SELECT EXPORT_OBJECTS('','mickey',FALSE); ) , then create projections on both sides that support the join optimally:

CREATE PROJECTION mickey_join_helper
AS SELECT * FROM mickey
ORDER BY id
SEGMENTED BY HASH(id) ALL NODES;
CREATE PROJECTION helper_join_mickey
AS SELECT * FROM helper
ORDER BY id
SEGMENTED BY HASH(id) ALL NODES;

You might also want to run SELECT ANALYZE_STATISTICS() on both tables before you add that column.

yossibm · April 2020

After preliminary testing it looks like it is working without overhead.
Thank you @marcothesane !

How to add a new column to a table efficiently?

Best Answer

Answers

Leave a Comment