Preseve Empty String when using FCSV Parser

mark_d_drakemark_d_drake Community Edition User
edited May 2021 in General Discussion

If I have a CSV file with an empty string
eg

"XXX","","YYY"

and use the following options with COPY:

PARSER fcsvparser(type='rfc4180', header=false, trim=false) NULL ''`,

I get a NULL instead of an empty string for column 2. Is there a setting that will preserve the EMPTY string ?

Answers

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Maybe a FILLER?

    verticademos=> \d mark;
                                       List of Fields by Tables
     Schema | Table | Column |    Type     | Size | Default | Not Null | Primary Key | Foreign Key
    --------+-------+--------+-------------+------+---------+----------+-------------+-------------
     public | mark  | c1     | varchar(80) |   80 |         | f        | f           |
     public | mark  | c2     | varchar(80) |   80 |         | f        | f           |
     public | mark  | c3     | varchar(80) |   80 |         | f        | f           |
    (3 rows)
    
    verticademos=> \! cat /home/dbadmin/mark.txt
    "XXX","","YYY"
    
    verticademos=> COPY mark (c1, c2_f FILLER VARCHAR, c2 AS NVL(c2_f, ''), c3) FROM '/home/dbadmin/mark.txt' PARSER fcsvparser(type='rfc4180', header=false, trim=false);
     Rows Loaded
    -------------
               1
    (1 row)
    
    verticademos=> SELECT * FROM mark WHERE c2 = '';
     c1  | c2 | c3
    -----+----+-----
     XXX |    | YYY
    (1 row)
    
  • mark_d_drakemark_d_drake Community Edition User

    Jim
    Thanks for the response, I'll give it a go. I am writing a general purpose tool https://github.com/markddrake/YADAMU---Yet-Another-DAta-Migration-Utilitywhich will generate these commands. How expensive would this be, assuming i Have to apply it to all VARCHAR columns (which would be the majority of the columns loaded). Eg is it worth me adding a switch to allow people to only enable preserving zero length strings if it is important to them.

    Also, since trim=false preserves leading and trailing whitespace wouldn't it be logical for it to also treat "" as zero length sting, particularly as ,, is explicitly mapped to NULL.

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited May 2021

    @mark_d_drake - That's cool that you are writing a migration utility!

    About the NULL / Empty String - Are you sure that folks actually don't want NULL? In databases like Oracle, NULL and Empty String are treated as the same thing. But in Vertica, they are different.

    verticademos=> SELECT NULL <=> NULL "Are they the same in Vertica? (Yup!)";
     Are they the same in Vertica? (Yup!)
    --------------------------------------
     t
    (1 row)
    
    verticademos=> SELECT NULL <=> '' "Are they the same in Vertica? (Nope!)";
     Are they the same in Vertica? (Nope!)
    ---------------------------------------
     f
    (1 row)
    
    

    About the Vertica COPY, note that the fcsvparser in much slower than using the default delimited parser.

    So, in your example, you can use the default parser like this:

    verticademos=> COPY mark (c1, c2_filler FILLER VARCHAR, c2 AS NVL(c2_filler, ''), c3) FROM '/home/dbadmin/mark.txt' ENCLOSED BY '"' DELIMITER ',';
     Rows Loaded
    -------------
               1
    (1 row)
    
    verticademos=> SELECT * FROM mark WHERE c2 = '';
     c1  | c2 | c3
    -----+----+-----
     XXX |    | YYY
    (1 row)
    
  • mark_d_drakemark_d_drake Community Edition User

    Jim
    AFAIK (20 +years as a PM at Oracle), Oracle is the only RDBMS that converts empty string to NULL for reasons that are either historical or hysterical depending on your perspective.
    As I'm sure you are aware the answer to your question is probably.. Eg if you put 50 customers in a room and ask the question 20 will want one behavior, 20 the other and 10 either won't understand the question or won't care but they will all have a very strong opinion of what the default should be...
    The reason I chose CSV is that I use the same code to generate CSV file for ingestion by different databases, Also using CSV seemed a little more 'professional' since it is (in theory) some what well defined by the rfc4180. The other problem with delimited data is that, as a general purpose tool I have absolutely no idea what characters might exist in the source data..

  • mark_d_drakemark_d_drake Community Edition User
    edited May 2021

    Jim will your solution differentiate between "XXX",,"YYY" and "XXX","","YYY"
    E.G. The horrible situation where some of the source columns contained NULL and others contained the EMPTY string.

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited May 2021

    No. You will have to handle that manually. Maybe with a DECODE?

    Example:

    verticademos=> TRUNCATE TABLE mark;
    TRUNCATE TABLE
    
    verticademos=> \d mark
    \                                   List of Fields by Tables
     Schema | Table | Column |    Type     | Size | Default | Not Null | Primary Key | Foreign Key
    --------+-------+--------+-------------+------+---------+----------+-------------+-------------
     public | mark  | c1     | varchar(80) |   80 |         | f        | f           |
     public | mark  | c2     | varchar(80) |   80 |         | f        | f           |
     public | mark  | c3     | varchar(80) |   80 |         | f        | f           |
    (3 rows)
    
    verticademos=> \! cat /home/dbadmin/mark.txt
    "XXX","","YYY"
    "XXX","''","YYY"
    
    verticademos=> COPY mark (c1, c2_filler FILLER VARCHAR, c2 AS NVL(DECODE(c2_filler, '''''', '', c2_filler), ''), c3) FROM '/home/dbadmin/mark.txt' ENCLOSED BY '"' DELIMITER ',';
     Rows Loaded
    -------------
               2
    (1 row)
    
    verticademos=> SELECT * FROM mark;
     c1  | c2 | c3
    -----+----+-----
     XXX |    | YYY
     XXX |    | YYY
    (2 rows)
    
    verticademos=> SELECT * FROM mark WHERE c2 = '';
     c1  | c2 | c3
    -----+----+-----
     XXX |    | YYY
     XXX |    | YYY
    (2 rows)
    

    But, keep in mind, someone might acually want to load the two single quotes as data into a table...

  • mark_d_drakemark_d_drake Community Edition User
    edited May 2021

    Jim
    I need to stick with CSV, for exactly that reason, I don't know what might be in the table, so I really need a solution where I can differentiate between, and preserve both, NULL and EMPTY string should they occur in different rows of the same table. I can (although it would have a performance overhead) scan the data as I'm preparing the CSV files and split the dataset into 2 parts (those with NULL and those with EMPTY String), but I don't think that really solves the problem as if I then extend the problem to 2 columns I have 4 possible datafiles, and with 3 columns I have 8 etc...

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file