How to load data from a file where the column position in that file keeps on changing?

How to load data from a file where the column position in that file keeps on changing. How to map a column in a file to the correct column in the corresponding table?

For example:
Table: foo(col1,col2,col3,col4)

Now load the data from a file where column position in that file keeps on changing.

File1: col1,col2,col3,col4
File2: col2,col3,col4,col1
File3: col3,col4,col1,col2

and so on...

How to map correct column from file to table?

Comments

  • Example

    --
    -- file 1
    --
    dbadmin@twingo:/tmp$ cat file_01.csv
    1|a|2000-01-01
    2|b|2000-01-02


    --
    -- file 2
    --
    dbadmin@twingo:/tmp$ cat file_02.csv
    3|2001-01-01|x
    4|2001-01-02|y

    --
    -- create table
    --
    dbadmin=> create table Siddarth (col1 int, col2 char, col3 date);
    CREATE TABLE


    --
    -- load file 1
    --
    dbadmin=> copy Siddarth from '/tmp/file_01.csv' direct abort on error;
    Rows Loaded
    -------------
    2
    (1 row)


    --
    -- load file 2
    --
    dbadmin=> copy Siddarth(col1, col3, col2) from '/tmp/file_02.csv' direct abort on error;
    Rows Loaded
    -------------
    2
    (1 row)


    --
    -- show results
    --
    dbadmin=> select * from Siddarth ;
    col1 | col2 | col3
    ------+------+------------
    1 | a | 2000-01-01
    2 | b | 2000-01-02
    3 | x | 2001-01-01
    4 | y | 2001-01-02
    (4 rows)
  • Thanks Daniel but I am looking for a generic solution. 

    Can we use one single COPY command for above 2 cases you mentioned?

    The number of columns in data files that we have to load is fixed but not the columns order. We are looking for a way  by which we can write a generic COPY command statement that we can use for all the combinations.

    Like in case of your example can we somehow write one COPY command that solves both the scenarios you have explained.

    We may have 100 datafiles to load, so for all those files we will have to write 100 copy command.
  • Hi Siddarth,

    How do you know what order the columns are in each file?  Do they have a header row or something?

    If the columns can just be in any random order and there's no programmatic way to tell what order they are in, the problem becomes somewhat harder...

    Adam
  • Hi!

    1. It's possible with some conditions with COPY statement only, but its ugly
    2. You can write UDF parser https://my.vertica.com/docs/6.1.x/HTML/index.htm#17840.htm
    3. To use in some decorator (script) that will process a file and will create a right COPY statement


    Example (1)
    daniel=> \p
    copy Siddarth
    (
    fcol1 filler varchar(65000),
    fcol2 filler varchar(65000),
    fcol3 filler varchar(65000),
    col1 as case
    when regexp_like(fcol1, '\d+') then fcol1::int
    when regexp_like(fcol2, '\d+') then fcol2::int
    else fcol3::int
    end,
    col2 as case
    when regexp_like(fcol1, '[a-zA-Z]') then fcol1::char
    when regexp_like(fcol2, '[a-zA-Z]') then fcol2::char
    else fcol3::char
    end,
    col3 as case
    when regexp_like(fcol1, '\d{4}-\d{2}-\d{2}') then fcol1::date
    when regexp_like(fcol2, '\d{4}-\d{2}-\d{2}') then fcol2::date
    else fcol3::date
    end
    )
    from '/tmp/file_01.csv' direct abort on error;
    daniel=> \g
    Rows Loaded
    -------------
    2
    (1 row)
    daniel=> \p
    copy Siddarth
    (
    fcol1 filler varchar(65000),
    fcol2 filler varchar(65000),
    fcol3 filler varchar(65000),
    col1 as case
    when regexp_like(fcol1, '\d+') then fcol1::int
    when regexp_like(fcol2, '\d+') then fcol2::int
    else fcol3::int
    end,
    col2 as case
    when regexp_like(fcol1, '[a-zA-Z]') then fcol1::char
    when regexp_like(fcol2, '[a-zA-Z]') then fcol2::char
    else fcol3::char
    end,
    col3 as case
    when regexp_like(fcol1, '\d{4}-\d{2}-\d{2}') then fcol1::date
    when regexp_like(fcol2, '\d{4}-\d{2}-\d{2}') then fcol2::date
    else fcol3::date
    end
    )
    from '/tmp/file_02.csv' direct abort on error;
    daniel=> \g
    Rows Loaded
    -------------
    2
    (1 row)

  • Hi Adam,

    There is no way by which we can check the ordering of the columns. As per my discussion, they told me that columns may or may not have headers. Even that is not fixed. Order of the columns will be random. The only thing fixed is the column number. I don't think Vertica Copy command support any of this requirement.

    Thanks Daniel I will give it a try and will get back to you on this. 
  • Thanks this solution works if we have columns of unique data types. But if we have multiple columns of same data type, it is not working.

    file1
    col1|col2|col3|col4|col5|col61
    a2|2000-01-01|a4|15|2000-01-022|b2|2000-01-02|b4|25|2000-01-02

    file2
    col2|col3|col4|col5|col6|col1
    a2|2000-01-01|a4|15|2000-01-02|1b2|2000-01-02|b4|15|2000-01-02|2
  • Hi Siddarth,

    So, if there are no headers, and the columns could be in any order, and they could be all the same data types and similar values, how can you tell the columns apart at all?

    Could you give an example of another existing program that does what you're trying to do?  Or describe how you would identify the columns by hand, if you don't know how you would do it?  

    Or are you just looking for a place to dump the data?  More like file storage than a database.  In that case you could always just create a table that just has VARCHAR columns (even one big VARCHAR column for the whole row), and not worry about the real data type up front.  Your users could then write a SELECT statement with type casts, regular expressions, etc., to get the data into whatever form they really want.

    Incidentally, Vertica 7's new (and recently-announced) FlexTables feature bundles some new COPY parsers, one of which will I believe will handle even this case, with as little information as you have about your data.  Given how little information you have, though, what it will do will be basically equivalent to a bunch of VARCHAR columns.  And as I'm sure Daniel will be quick to point out, Vertica 7 is brand-new; I assume you're not using it yet...

    Adam
  • For below records it is not working

    dbadmin@twingo:/tmp$ cat file_01.csv
    1|a|2000-01-01
    2|b|2000-01-02
    -- file 2
    dbadmin@twingo:/tmp$ cat file_02.csv
    2001-01-01|x|3 2001-01-02|y|4
  • Hi Adam,

    Yes Adam this problem is somewhat confusing. We are creating every thing at run time from client application based on user inputs. We will get multiple files with a specific set of columns for which we already have created our facts and dimensions. But the problem is the data file which we are getting has random column order. It may have all INT or combination of all data types.

    We have to restrict the structure of the file.

    I have also worked on Crane as a part of a Optimizer testing team in my previous organization  :).

    Regards,
    Siddarth Kotnala
  • I wrote that it's possible "with some conditions", I didn't say that it's a generic solution.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file