COPY command with multi-character delimiters

Hi,

 

I have data in the below mentioned format:

 

1234567^~^1234567^~^1234567^~^2004-02-02

 

 

The file has multiple rows of data in the same format.

 

Can you tell me how to use it with COPY command as the Vertica allows only a SINGLE character delimiter.

 

I cant use a file parser as the size of the file is huge and would take immense time to parse.

 

 

Any inputs would be highly appreciated.

 

Thanks

Comments

  • Create your data file.

    (dbadmin@:5433) [dbadmin] > \! cat /tmp/file.dat 
    1234567^~^1234567^~^1234567^~^2004-02-02

    Create the table

    (dbadmin@:5433) [dbadmin] > \d test
    List of Fields by Tables
    Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
    --------+-------+--------+-------------+------+---------+----------+-------------+-------------
    public | test | col1 | varchar(20) | 20 | | f | f |
    public | test | col2 | varchar(20) | 20 | | f | f |
    public | test | col3 | varchar(20) | 20 | | f | f |
    public | test | col4 | varchar(20) | 20 | | f | f |
    (4 rows)

    Create the copy commnad

    copy test(
    v1 filler varchar(20),
    v2 filler varchar(20),
    v3 filler varchar(20),
    v4 filler varchar(20),
    col1 as TRIM(BOTH '^' FROM v1) ,
    col2 as TRIM(BOTH '^' FROM v2),
    col3 as TRIM(BOTH '^' FROM v3),
    col4 as TRIM(BOTH '^' FROM v4) ) from '/tmp/file.dat' delimiter '~' direct;

    - you just need to use one single character delimiter, when loading , but you can transform the rest using fillers and the TRIM build-in function.

     

    Check that data 

    (dbadmin@:5433) [dbadmin] > select * from test;
    col1 | col2 | col3 | col4
    ---------+---------+---------+------------
    1234567 | 1234567 | 1234567 | 2004-02-02
    (1 row)

     

    I hope this was what you are looking for :) 

  •  

    Thanks Adrian_Oprea_1 !! Your solution worked perfectly fine for above example. Using fillers was a great idea. 

     

    We are purposely using multi-character delimiters (^~^) as the columns may have data that can contain tilda (~). However, there is no support for multi-character delimiter in Vertica.

     

    Is there anything using which we can ignore tilda from the coumn data. 

     

    Sample:

    123~4567^~^1234567^~^1234567^~^2004-02-04

     

    Looking forward for your response. 

  • I don't understand your question ? 

    could you rephrase it ?

  • Hi Adrian,

     

    What I mean to ask you is that, there is a possibility that the data might contain special characters. And if one of these special characters matches the delimiter then that ENTIRE ROW goes under Rejected data.

     

    So can you tell me how to handle this scenario i.e. what if the delimiter is a part of the data??

     

     

    Sample:

    123~4567^~^1234567^~^1234567^~^2004-02-04

     

     

    Hope this makes it clear. Thanks.

  • Ohhh, 

    I faced this in the past, i had to sweet a bit to fix it , but i could not do it dynamicly i had ot do it by hand. 

    I sudgesst to solve this at the "root" where the data is generated and make sure that you have specific delimiters in place.

  • Well, I too faced the same issue but unfortunately I dont have the authority to alter the data content. ;-) 

     

    So after a little ground work, we built a Parser, which could parse multi-character delimiters. And then after you deploy it. you can use it in the COPY commands as a parameter and it works perfectly.

     

     

    Thanks again for the help Adrian. 

  • Could you share it with the community ? 

  • marcothesanemarcothesane - Select Field - Administrator

     

    Hi tangiblegeek -

    If you can influence the creation of the load file - consider using string delimiters/enclosers:

    Instead of:

    123~4567^~^1234567^~^1234567^~^2004-02-04

    you would have:

    '123~4567'~1234567~1234567~'2004-02-04'

    And your COPY command would be:

     

    COPY <mytable> FROM LOCAL '<mytable.tsv>' DELIMITER '~' ENCLOSED BY '''' DIRECT;

     

    If the string delimiter can occur within the string, it would need to be escaped by a backslash ...

    Good luck

    Marco

    (I work for HP) ... 

  • Hi Marco, 

     

    Unfortunately, I dont have the access to alter the data, but there is fix that we have put eventually and it has been working pretty fine till now, so decided to share that with the community.

     

     

    We are using a shell script, and using 'awk' and 'sed' commands we replace the multi-char delimiter with a single delimiter while loading. Hence, the file gets loaded perfiectly also, the file also does not get modified. 

     

    The flow is as follows:

    1. UDX calls shell script

    2. Newly constructed COPY statement gets fired at command prompt

    3. Delimiter gets replaced 'on-the-fly' (actual file is still the same)

    4. Shell returns control to UDX for remaining steps

     

    Hope this works for you guys.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file