Insert select doesn't insert data in the right order
I am noticing some discrepancies with the way vertica is performing an insert select. I have a table with 3 columns with the data types defined as (identity(1), int, varchar). I am trying to insert data into this table by doing an insert (col2,col3) select col1, col2 from source_table order by col1,col2. After the insert if I run a select * from target_Table order by 1,2 the data seems to be out of order. I also tried running insert into target_Table (col2, col3) select * from (select col1,col2 from source_table) x order by x.col1, x.col2 but still no luck. This is a 3 node cluster and it seems like there were 3 sessions used to insert data into the target table as there 3 sets of caches. I checked this by repeating the exercise with the default value of the cache (250,000). I noticed there were 3 different set of values for the identity column (1,2,3) , (250001, 250002...) and (500001, 500002..). I would assume it would insert rows in one session and just increment the identity column by 1 for each row its trying to insert. FYI, I also tried to do the same thing with sequences and used nextval as the default value. Same behavior (Kinda expected but I was still trying my luck). Is this expected behavior or am I doing something wrong here?
0
Comments
An alternative answer - What if the target table does not use an identity column but just an integer. And the insert statement is changed to "insert into target_table (id_col, col2, col3) select my_seq.nextval, col1, col2 from source_table order by 2, 3"
Would that ensure the ordering Karan was looking for?
Thanks,
Bing
The only time it worked is when I ran the exact same test on a single-node instance. On a multi-node cluster, even after I created the source_table with a sorted super-projection (on col1, col2), the result was still messed up. And I tried a one-node local temporary table as target, the same deal.
It only makes sense if you start taking data distribution into consideration. From application/development point of view, however, this is a rather inconsistent behavior. Personally I think this is a bug. Nevertheless, glad you have a way around it.
When using a default nextval column instead of identity, you have the flexibility of either specifying the value , or let the default nextval do it. As per example below.
In that way you can control the behavior in the target table and are not always at the mercy of the sequencer.
create sequence my_seq START 1;
create sequence my_seq2 START 1;
create table t4 (c1 INTEGER DEFAULT NEXTVAL('my_seq'), c2 int);
create table t5 (c1 INTEGER DEFAULT NEXTVAL('my_seq2'), c2 int);
insert into t4 (c2) values (5);
insert into t4 (c2) values (6);
insert into t4 (c2) values (7);
commit;
insert into t5 (c1,c2) select c1, c2 from t4;
insert into t5 (c1,c2) select c1, c2 from t4;
commit;
dbadmin=> select * from t5;
c1 | c2
-----+----
108 | 5
109 | 6
110 | 7
108 | 5
109 | 6
110 | 7
(6 rows)
dbadmin=> select * from t4;
c1 | c2
-----+----
108 | 5
109 | 6
110 | 7
(3 rows)
Note how the values in t5 are the same from t4 and not those generated by my_seq2.
Hope this helps.