How to add a new column to a table efficiently?
I have a large table (~1B rows, ~50 columns) on a single node (no projections, no partition or segmentation), and data for a new column with the same exact keys and same order.
My use case is to repeatedly add such new columns.
After adding an empty column to the table, I need to write the column data.
I have tried to use Update directly and it took many hours.
I have tried to create a new temp table and then do a Merge. Although writing was efficient, the merge took hours and it also required to run a TupleMover mergeout before doing the next merge (without doing so, the next merge would be extremely slow).
I have tried to Join the temp table and the original table into a new table, but it was also very slow, especially after the original table grew wider.
Is there another more efficient solution - regarding that I have the data sorted and with the same primary keys? maybe a low level solution?
Thanks,
Yossi
Best Answer
-
marcothesane - Select Field - Administrator
Hi @yossibm
The first documentation on the DEFAULT clause in a column definition can be found here:
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Statements/column-constraint.htm
Scroll to "DEFAULT" in the column constraint tabular list.
So you need to derive the content of your new column from sources outside of the current table.
Then you need to check this part - "Flattened Tables" in the Vertica documentation:
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/AnalyzingData/FlattenedTables/FlattenedTables.htm?zoom_highlight=flattened table
Here, we introduce the concept that you can use also a single-row-single-column select as an expression for a column DEFAULT.
Let me put together a Mickey Mouse example:-- this is the base table -- as yet without the column with the complex calculation CREATE TABLE mickey ( id BIGINT NOT NULL , first_name VARCHAR(32) , last_name VARCHAR(32) , hdate DATE ); -- out CREATE TABLE -- insert some data ... INSERT /*+ DIRECT */ INTO mickey(id,first_name,last_name,hdate) SELECT 1,'Arthur','Dent' ,DATE '2010-04-01' UNION ALL SELECT 2,'Ford','Prefect' ,DATE '2011-04-01' UNION ALL SELECT 3,'Zaphod','Beeblebrox' ,DATE '2012-04-01' UNION ALL SELECT 4,'Tricia','McMillan' ,DATE '2013-04-01' UNION ALL SELECT 5,'Gag','Halfrunt' ,DATE '2014-04-01' UNION ALL SELECT 6,'Prostetnic Vogon','Jeltz',DATE '2015-04-01' UNION ALL SELECT 7,'Lionel','Prosser' ,DATE '2016-04-01' UNION ALL SELECT 8,'Benji','Mouse' ,DATE '2017-04-01' UNION ALL SELECT 9,'Frankie','Mouse' ,DATE '2018-04-01' UNION ALL SELECT 10,'Svlad','Cjelli' ,DATE '2019-04-01' ; -- out OUTPUT -- out -------- -- out 10 -- this is the helper table -- to contain the complex-calculated value -- needed finally in the "mickey" table CREATE TABLE helper ( id INT , months_worked INT ); -- out CREATE TABLE -- this is to mimick the complex ETL calculation INSERT INTO helper SELECT id , MONTHS_BETWEEN(CURRENT_DATE,hdate) FROM mickey; -- out OUTPUT -- out -------- -- out 10 COMMIT; -- out COMMIT -- check content of helper table SELECT * FROM helper; -- out id | months_worked -- out ----+--------------- -- out 1 | 121 -- out 2 | 109 -- out 3 | 97 -- out 4 | 85 -- out 5 | 73 -- out 6 | 61 -- out 7 | 49 -- out 8 | 37 -- out 9 | 25 -- out 10 | 13 -- out (10 rows) -- out -- Now - fix the "mickey" table: ALTER TABLE mickey ADD months_worked INTEGER DEFAULT(SELECT months_worked FROM helper WHERE id=mickey.id); -- out ALTER TABLE --check the contents of the "mickey" table now ... SELECT * FROM mickey; -- out Time: First fetch (0 rows): 120.181 ms. All rows formatted: 120.244 ms -- out id | first_name | last_name | hdate | months_worked -- out ----+------------------+------------+------------+--------------- -- out 1 | Arthur | Dent | 2010-04-01 | 121 -- out 2 | Ford | Prefect | 2011-04-01 | 109 -- out 3 | Zaphod | Beeblebrox | 2012-04-01 | 97 -- out 4 | Tricia | McMillan | 2013-04-01 | 85 -- out 5 | Gag | Halfrunt | 2014-04-01 | 73 -- out 6 | Prostetnic Vogon | Jeltz | 2015-04-01 | 61 -- out 7 | Lionel | Prosser | 2016-04-01 | 49 -- out 8 | Benji | Mouse | 2017-04-01 | 37 -- out 9 | Frankie | Mouse | 2018-04-01 | 25 -- out 10 | Svlad | Cjelli | 2019-04-01 | 13 -- out (10 rows)
This should work, shouldn't it?
5
Answers
Hi @yossibm -
Why don't you try something like:
, instead of adding the new column first and updating it after?
I have always regarded this possibility as one of the real perks of a columnar database ...
Good luck
Marco
Hi,
Thanks for the suggestion.
Unfortunately I can't use something like that because the data is being generated by another process and cannot be calculated directly from existing columns.
Also, I haven't found that syntax in the documentation.
Kind regards,
Yossi
Hi,
Thank you for that comprehensive solution!
The example works for me, but I noticed that the original "helper" table is referenced by the "mickey" table.
I have tried in the past to use Select with Join, to avoid the slow merge/update operation,
but the Join caused the Select runtime to be very unpredictable: some queries even took hours, without any obvious reason.
Can I trust that method to be more efficient / predictable than using Join ?
Is it possible to copy the column content as to avoid the reference?
Thanks and best regards,
Yossi
The Join is efficient when both tables are segmented and ordered by the join column.
If the super-projection is not already designed in the right way (and you can check that by running:
SELECT EXPORT_OBJECTS('','mickey',FALSE);
) , then create projections on both sides that support the join optimally:You might also want to run
SELECT ANALYZE_STATISTICS()
on both tables before you add that column.After preliminary testing it looks like it is working without overhead.
Thank you @marcothesane !