COPY command with multi-character delimiters

tangiblegeek · May 2015

Hi,

I have data in the below mentioned format:

1234567^~^1234567^~^1234567^~^2004-02-02

The file has multiple rows of data in the same format.

Can you tell me how to use it with COPY command as the Vertica allows only a SINGLE character delimiter.

I cant use a file parser as the size of the file is huge and would take immense time to parse.

Any inputs would be highly appreciated.

Thanks

Adrian_Oprea_1 · May 2015

Create your data file.

(dbadmin@:5433) [dbadmin] > \! cat /tmp/file.dat 
1234567^~^1234567^~^1234567^~^2004-02-02

Create the table

(dbadmin@:5433) [dbadmin] > \d test
                                   List of Fields by Tables
 Schema | Table | Column |    Type     | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+--------+-------------+------+---------+----------+-------------+-------------
 public | test  | col1   | varchar(20) |   20 |         | f        | f           |
 public | test  | col2   | varchar(20) |   20 |         | f        | f           |
 public | test  | col3   | varchar(20) |   20 |         | f        | f           |
 public | test  | col4   | varchar(20) |   20 |         | f        | f           |
(4 rows)

Create the copy commnad

copy test(
v1 filler varchar(20), 
v2 filler varchar(20),
v3 filler varchar(20),
v4 filler varchar(20),
col1 as TRIM(BOTH '^' FROM v1) ,
col2 as TRIM(BOTH '^' FROM v2),
col3 as TRIM(BOTH '^' FROM v3),
col4 as TRIM(BOTH '^' FROM v4) ) from '/tmp/file.dat' delimiter '~' direct;

- you just need to use one single character delimiter, when loading , but you can transform the rest using fillers and the TRIM build-in function.

Check that data

(dbadmin@:5433) [dbadmin] > select * from test;
  col1   |  col2   |  col3   |    col4
---------+---------+---------+------------
 1234567 | 1234567 | 1234567 | 2004-02-02
(1 row)

I hope this was what you are looking for

tangiblegeek · May 2015

Thanks Adrian_Oprea_1 !! Your solution worked perfectly fine for above example. Using fillers was a great idea.

We are purposely using multi-character delimiters (^~^) as the columns may have data that can contain tilda (~). However, there is no support for multi-character delimiter in Vertica.

Is there anything using which we can ignore tilda from the coumn data.

Sample:

123~4567^~^1234567^~^1234567^~^2004-02-04

Looking forward for your response.

Adrian_Oprea_1 · May 2015

I don't understand your question ?

could you rephrase it ?

tangiblegeek · May 2015

Hi Adrian,

What I mean to ask you is that, there is a possibility that the data might contain special characters. And if one of these special characters matches the delimiter then that ENTIRE ROW goes under Rejected data.

So can you tell me how to handle this scenario i.e. what if the delimiter is a part of the data??

Sample:

123~4567^~^1234567^~^1234567^~^2004-02-04

Hope this makes it clear. Thanks.

Adrian_Oprea_1 · May 2015

Ohhh,

I faced this in the past, i had to sweet a bit to fix it , but i could not do it dynamicly i had ot do it by hand.

I sudgesst to solve this at the "root" where the data is generated and make sure that you have specific delimiters in place.

tangiblegeek · June 2015

Well, I too faced the same issue but unfortunately I dont have the authority to alter the data content. ;-)

So after a little ground work, we built a Parser, which could parse multi-character delimiters. And then after you deploy it. you can use it in the COPY commands as a parameter and it works perfectly.

Thanks again for the help Adrian.

Adrian_Oprea_1 · June 2015

Could you share it with the community ?

marcothesane · June 2015

Hi tangiblegeek -

If you can influence the creation of the load file - consider using string delimiters/enclosers:

Instead of:

123~4567^~^1234567^~^1234567^~^2004-02-04

you would have:

'123~4567'~1234567~1234567~'2004-02-04'

And your COPY command would be:

COPY <mytable> FROM LOCAL '<mytable.tsv>' DELIMITER '~' ENCLOSED BY '''' DIRECT;

If the string delimiter can occur within the string, it would need to be escaped by a backslash ...

Good luck

Marco

(I work for HP) ...

tangiblegeek · June 2016

Hi Marco,

Unfortunately, I dont have the access to alter the data, but there is fix that we have put eventually and it has been working pretty fine till now, so decided to share that with the community.

We are using a shell script, and using 'awk' and 'sed' commands we replace the multi-char delimiter with a single delimiter while loading. Hence, the file gets loaded perfiectly also, the file also does not get modified.

The flow is as follows:

1. UDX calls shell script

2. Newly constructed COPY statement gets fired at command prompt

3. Delimiter gets replaced 'on-the-fly' (actual file is still the same)

4. Shell returns control to UDX for remaining steps

Hope this works for you guys.

We're Moving!

Create My New Community Account Now

COPY command with multi-character delimiters

Comments

Leave a Comment