Insert select doesn't insert data in the right order

I am noticing some discrepancies with the way vertica is performing an insert select. I have a table with 3 columns with the data types defined as (identity(1), int, varchar). I am trying to insert data into this table by doing an insert (col2,col3) select col1, col2 from source_table order by col1,col2. After the insert if I run a select * from target_Table order by 1,2 the data seems to be out of order. I also tried running insert into target_Table (col2, col3) select * from (select col1,col2 from source_table) x order by x.col1, x.col2 but still no luck. This is a 3 node cluster and it seems like there were 3 sessions used to insert data into the target table as there 3 sets of caches. I checked this by repeating the exercise with the default value of the cache (250,000). I noticed there were 3 different set of values for the identity column (1,2,3) , (250001, 250002...) and (500001, 500002..). I would assume it would insert rows in one session and just increment the identity column by 1 for each row its trying to insert. FYI, I also tried to do the same thing with sequences and used nextval as the default value. Same behavior (Kinda expected but I was still trying my luck). Is this expected behavior or am I doing something wrong here?

Comments

  • You are ordering by the target sequence ids and I don't think they are expected to be in order. Being distributed ids, they maintain separate caches as you noticed and can be used independently. The real feature of the identity column is its uniqueness. I think the observed behavior is by design. If you need to preserve the same source order, you could reuse the ids from the source table by just inserting them.
  • I didn't understand the last part. How do I get them both to be in the same order? 
  • Cool! Thanks! 
  • @Colin - can you elaborate on what you mean by "you could reuse the ids from the source table by just inserting them"? I didn't see an ID column in Karan's source table?

    An alternative answer - What if the target table does not use an identity column but just an integer. And the insert statement is changed to "insert into target_table (id_col, col2, col3) select my_seq.nextval, col1, col2 from source_table order by 2, 3"

    Would that ensure the ordering Karan was looking for?

    Thanks,

    Bing
  • Hey Bing, I do have an ID column but I didn't specify it explicitly. What Colin means to say is that I can just use my ID column to simulate an Identity column. Your second question. It doesn't work either. I just tried it out.
  • Now that I did a quick test myself, I found the situation rather strange: if I did just "select" but not "insert", the ordering looks right. Once I put the "select" after the "insert" then the same problem happened again. So apparently it only manifest itself when you write to disk. Also I tried with segmented vs. nonsegmented target tables and it didn't make a difference.

    The only time it worked is when I ran the exact same test on a single-node instance.  On a multi-node cluster, even after I created the source_table with a sorted super-projection (on col1, col2), the result was still messed up.  And I tried a one-node local temporary table as target, the same deal.

    It only makes sense if you start taking data distribution into consideration. From application/development point of view,  however, this is a rather inconsistent behavior. Personally I think this is a bug. Nevertheless, glad you have a way around it.
  • It kinda makes sense to have the data split across the 3 nodes unless the data is coming from an external source. It's certainly more efficient that way for big volumes of data that vertica is designed to handle. I wish there was an option to force it to order the data at the cost of performance.
  • When using a default nextval column instead of identity, you have the flexibility of either specifying the value , or let the default nextval do it. As per example below.
    In that way you can control the behavior in the target table and are not always at the mercy of the sequencer.

    create sequence my_seq START 1;
    create sequence my_seq2 START 1;

    create table t4 (c1 INTEGER DEFAULT NEXTVAL('my_seq'), c2 int);
    create table t5 (c1 INTEGER DEFAULT NEXTVAL('my_seq2'), c2 int);

    insert into t4 (c2) values (5);
    insert into t4 (c2) values (6);
    insert into t4 (c2) values (7);
    commit;

    insert into t5 (c1,c2) select c1, c2 from t4;
    insert into t5 (c1,c2) select c1, c2 from t4;
    commit;

    dbadmin=> select * from t5;
     c1  | c2
    -----+----
     108 |  5
     109 |  6
     110 |  7
     108 |  5
     109 |  6
     110 |  7
    (6 rows)

    dbadmin=> select * from t4;
     c1  | c2
    -----+----
     108 |  5
     109 |  6
     110 |  7
    (3 rows)

    Note how the values in t5 are the same from t4 and not those generated by my_seq2.

     

    Hope this helps.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file