How to load data from a file where the column position in that file keeps on changing?

Siddarth · December 2013

How to load data from a file where the column position in that file keeps on changing. How to map a column in a file to the correct column in the corresponding table?

For example:
Table: foo(col1,col2,col3,col4)

Now load the data from a file where column position in that file keeps on changing.

File1: col1,col2,col3,col4
File2: col2,col3,col4,col1
File3: col3,col4,col1,col2

and so on...

How to map correct column from file to table?

Daniel_Leybovic · December 2013

Example

--
-- file 1
--
dbadmin@twingo:/tmp$ cat file_01.csv 
1|a|2000-01-01
2|b|2000-01-02


--
-- file 2
--
dbadmin@twingo:/tmp$ cat file_02.csv 
3|2001-01-01|x
4|2001-01-02|y


--
-- create table
--
dbadmin=> create table Siddarth (col1 int, col2 char, col3 date);
CREATE TABLE


--
-- load file 1
--
dbadmin=> copy Siddarth from '/tmp/file_01.csv' direct abort on error;
 Rows Loaded 
-------------
           2
(1 row)


--
-- load file 2
--
dbadmin=> copy Siddarth(col1, col3, col2) from '/tmp/file_02.csv' direct abort on error;
 Rows Loaded 
-------------
           2
(1 row)


--
-- show results
--
dbadmin=> select * from Siddarth ;
 col1 | col2 |    col3    
------+------+------------
    1 | a    | 2000-01-01
    2 | b    | 2000-01-02
    3 | x    | 2001-01-01
    4 | y    | 2001-01-02
(4 rows)

Siddarth · December 2013

Thanks Daniel but I am looking for a generic solution.

Can we use one single COPY command for above 2 cases you mentioned?

The number of columns in data files that we have to load is fixed but not the columns order. We are looking for a way by which we can write a generic COPY command statement that we can use for all the combinations.

Like in case of your example can we somehow write one COPY command that solves both the scenarios you have explained.

We may have 100 datafiles to load, so for all those files we will have to write 100 copy command.

[Deleted User] · December 2013

Hi Siddarth,

How do you know what order the columns are in each file? Do they have a header row or something?

If the columns can just be in any random order and there's no programmatic way to tell what order they are in, the problem becomes somewhat harder...

Adam

Daniel_Leybovic · December 2013

Hi!

1. It's possible with some conditions with COPY statement only, but its ugly
2. You can write UDF parser https://my.vertica.com/docs/6.1.x/HTML/index.htm#17840.htm
3. To use in some decorator (script) that will process a file and will create a right COPY statement

Example (1)

daniel=> \p
copy Siddarth
(
    fcol1 filler varchar(65000),
    fcol2 filler varchar(65000),
    fcol3 filler varchar(65000),
    col1  as case 
             when regexp_like(fcol1, '\d+') then fcol1::int
             when regexp_like(fcol2, '\d+') then fcol2::int
             else fcol3::int
             end,
    col2 as case
             when regexp_like(fcol1, '[a-zA-Z]') then fcol1::char
             when regexp_like(fcol2, '[a-zA-Z]') then fcol2::char
             else fcol3::char
             end,
    col3 as case
             when regexp_like(fcol1, '\d{4}-\d{2}-\d{2}') then fcol1::date
             when regexp_like(fcol2, '\d{4}-\d{2}-\d{2}') then fcol2::date
             else fcol3::date
             end
)
from '/tmp/file_01.csv' direct abort on error;
daniel=> \g
 Rows Loaded 
-------------
           2
(1 row)

daniel=> \p
copy Siddarth
(
    fcol1 filler varchar(65000),
    fcol2 filler varchar(65000),
    fcol3 filler varchar(65000),
    col1  as case 
             when regexp_like(fcol1, '\d+') then fcol1::int
             when regexp_like(fcol2, '\d+') then fcol2::int
             else fcol3::int
             end,
    col2 as case
             when regexp_like(fcol1, '[a-zA-Z]') then fcol1::char
             when regexp_like(fcol2, '[a-zA-Z]') then fcol2::char
             else fcol3::char
             end,
    col3 as case
             when regexp_like(fcol1, '\d{4}-\d{2}-\d{2}') then fcol1::date
             when regexp_like(fcol2, '\d{4}-\d{2}-\d{2}') then fcol2::date
             else fcol3::date
             end
)
from '/tmp/file_02.csv' direct abort on error;
daniel=> \g
 Rows Loaded 
-------------
           2
(1 row)

Siddarth · December 2013

Hi Adam,

There is no way by which we can check the ordering of the columns. As per my discussion, they told me that columns may or may not have headers. Even that is not fixed. Order of the columns will be random. The only thing fixed is the column number. I don't think Vertica Copy command support any of this requirement.

Thanks Daniel I will give it a try and will get back to you on this.

Siddarth · December 2013

Thanks this solution works if we have columns of unique data types. But if we have multiple columns of same data type, it is not working.

file1
col1|col2|col3|col4|col5|col61
a2|2000-01-01|a4|15|2000-01-022|b2|2000-01-02|b4|25|2000-01-02

file2
col2|col3|col4|col5|col6|col1
a2|2000-01-01|a4|15|2000-01-02|1b2|2000-01-02|b4|15|2000-01-02|2

[Deleted User] · December 2013

Hi Siddarth,

So, if there are no headers, and the columns could be in any order, and they could be all the same data types and similar values, how can you tell the columns apart at all?

Could you give an example of another existing program that does what you're trying to do? Or describe how you would identify the columns by hand, if you don't know how you would do it?

Or are you just looking for a place to dump the data? More like file storage than a database. In that case you could always just create a table that just has VARCHAR columns (even one big VARCHAR column for the whole row), and not worry about the real data type up front. Your users could then write a SELECT statement with type casts, regular expressions, etc., to get the data into whatever form they really want.

Incidentally, Vertica 7's new (and recently-announced) FlexTables feature bundles some new COPY parsers, one of which will I believe will handle even this case, with as little information as you have about your data. Given how little information you have, though, what it will do will be basically equivalent to a bunch of VARCHAR columns. And as I'm sure Daniel will be quick to point out, Vertica 7 is brand-new; I assume you're not using it yet...

Adam

Siddarth · December 2013

For below records it is not working

dbadmin@twingo:/tmp$ cat file_01.csv
1|a|2000-01-01
2|b|2000-01-02

-- file 2
dbadmin@twingo:/tmp$ cat file_02.csv 
2001-01-01|x|3 2001-01-02|y|4

Siddarth · December 2013

Hi Adam,

Yes Adam this problem is somewhat confusing. We are creating every thing at run time from client application based on user inputs. We will get multiple files with a specific set of columns for which we already have created our facts and dimensions. But the problem is the data file which we are getting has random column order. It may have all INT or combination of all data types.

We have to restrict the structure of the file.

I have also worked on Crane as a part of a Optimizer testing team in my previous organization

.

Regards,
Siddarth Kotnala

Daniel_Leybovic · December 2013

I wrote that it's possible "with some conditions", I didn't say that it's a generic solution.

We're Moving!

Create My New Community Account Now

How to load data from a file where the column position in that file keeps on changing?

Comments

Leave a Comment