Options

Importing CSV files with text fields that contain double quotes using COPY

Hi - I've been having trouble using the COPY command to import CSV files that include double quotes within a text field.

I can get around this by creating the CSV file with the double quotes within the field TXTMD removed, but I would like to have the double quotes in the Vertica table.

COPY DataBase1.DF_EQUIPMT_TEXT
FROM LOCAL 'C:\work\double quotes test UTF8.csv'
DELIMITER ','
ENCLOSED BY '"'
NULL ''
ENFORCELENGTH
REJECTMAX 0
SKIP 1;

Here is the CSV file:
"DF_EQUIPMT","TXTMD"
"0000000010124","SWEEPER STREET: 72" MELROE"
"0000000010124","SWEEPER STREET: 72"" MELROE"
"0000000010000","ILLUMI INFR AN/PEQ-2A"
"0000000010000","TELEVISION RECEIVER HOME TYPE, 32" FLAT"
"0000000010000","TELEVISION RECEIVER HOME TYPE, 32"" FLAT"
"0000000010000","COMPUTER, PERSONAL WORKSTATION"

I've tried saving my CSV file using ANSI and UTF-8 encoding.

Here's my output using the above COPY command:
DF_EQUIPMT TXTMD


0000000010000 ILLUMI INFR AN/PEQ-2A
0000000010000 COMPUTER, PERSONAL WORKSTATION

Desired output:
DF_EQUIPMT TXTMD


0000000010124 SWEEPER STREET: 72" MELROE
0000000010124 SWEEPER STREET: 72"" MELROE
0000000010000 ILLUMI INFR AN/PEQ-2A
0000000010000 TELEVISION RECEIVER HOME TYPE, 32" FLAT
0000000010000 TELEVISION RECEIVER HOME TYPE, 32"" FLAT
0000000010000 COMPUTER, PERSONAL WORKSTATION

Any help would be greatly appreciated.

Thanks!
Trevor

Comments

  • Options
    [Deleted User][Deleted User] Administrator

    Have you tried filler columns in copy statement ?

    VMART=> create table test(c1 varchar(100), c2 varchar(100));
    CREATE TABLE
    VMART=> ! vi /tmp/test.dat
    VMART=> COPY test from '/tmp/test.dat' DELIMITER ',';

    Rows Loaded

           1
    

    (1 row)

    VMART=> select * from test;
    c1 | c2
    -----------------+------------------------------
    "0000000010124" | "SWEEPER STREET: 72" MELROE"
    (1 row)

    VMART=> COPY test(x filler varchar(100), y filler varchar(100), c1 as replace(x,'"','') , c2 as replace(y,'"','')) from '/tmp/test.dat' DELIMITER ',';

    Rows Loaded

           1
    

    (1 row)

    VMART=> select * from test;
    c1 | c2
    -----------------+------------------------------
    "0000000010124" | "SWEEPER STREET: 72" MELROE"
    0000000010124 | SWEEPER STREET: 72 MELROE
    (2 rows)

  • Options
    Ariel_CaryAriel_Cary Vertica Employee Employee

    Hi Trevor,

    Double quotes in the data is an interesting case. The general answer is that you need to escape your double-quote characters in your data. Then you can use the FCSVParser parser function telling it what your escape character is.

    (Also, note that there is no real CSV standard. There is the RFC4180 specification which does a CSV format, but different applications tend to accept variations of that CSV format. In Vertica, we support the RFC4180 type. But we also add another type that we call TRADITIONAL.)

    In the following example, the TRADITIONAL FCSVParser and a backslash ('\' ) as a escape character.

    CREATE TABLE DF_EQUIPMT_TEXT(DF_EQUIPMT varchar, TXTMD varchar);

    -- Take the data from stdin so I can inline the data, but you change it to a file source
    -- NB: You escape every " with the escaping character ('\')
    COPY DF_EQUIPMT_TEXT FROM STDIN PARSER FCSVParser(type='traditional', escape='\');
    "DF_EQUIPMT","TXTMD"`
    "0000000010124","SWEEPER STREET: 72\" MELROE"
    "0000000010124","SWEEPER STREET: 72\"\" MELROE"
    "0000000010000","ILLUMI INFR AN/PEQ-2A"
    "0000000010000","TELEVISION RECEIVER HOME TYPE, 32\" FLAT"
    "0000000010000","TELEVISION RECEIVER HOME TYPE, 32\"\" FLAT"
    "0000000010000","COMPUTER, PERSONAL WORKSTATION"
    .

    SELECT * from DF_EQUIPMT_TEXT;

    DF_EQUIPMT | TXTMD ---------------+------------------------------------------
    0000000010124 | SWEEPER STREET: 72" MELROE
    0000000010124 | SWEEPER STREET: 72"" MELROE
    0000000010000 | ILLUMI INFR AN/PEQ-2A
    0000000010000 | TELEVISION RECEIVER HOME TYPE, 32" FLAT
    0000000010000 | TELEVISION RECEIVER HOME TYPE, 32"" FLAT
    0000000010000 | COMPUTER, PERSONAL WORKSTATION
    (6 rows)

    You could also use the RFC4180 type parser. But the default escape character is a double-quote character; you cannot choose another character (from that perspective, the TRADITIONAL type is more flexible). You're data should then include double-quote twice (i.e. "") for each double-quote character in your data.

    Hope this helps.

  • Options
    edited May 2017

    @skamat - I have not tried filler columns in the COPY statement. I will try this.

    @Ariel_Cary - I will try this as well. I have been combining XLSX documents into a CSV via pandas and I can precede ever double quote with the escape character ('\') and use the FCSVParser.

    Thanks so much for your responses. I will try these both when our Vertica database is back up and running. I very much appreciate the help and have not tried these two options yet.

  • Options

    @Ariel_Cary
    I'm on Vertica Analytic Database v8.0.1-0. My error says that the function fcsvparser() does not exist when I run the COPY command below. Any suggestions? I feel like this solution should work for my CSV file. I've replaced all occurrences of double quotes with '\"'

    COPY DF_EQUIPMT_TEXT
    FROM LOCAL 'C:\work\file.csv'
              PARSER FCSVParser(type='traditional', escape='\')
              DELIMITER ','
              ENCLOSED BY '"'
              NULL ''
              ENFORCELENGTH
              REJECTMAX 0
              SKIP 1;
    

    Error
    12:19:08 [COPY - 0 rows, 0.012 secs] [Code: 3457, SQL State: 42883] [Vertica]VJDBC ERROR: Function fcsvparser() does not exist, or permission is denied for fcsvparser()
    ... 1 statement(s) executed, 0 rows affected, exec/fetch time: 0.012/0.000 sec [0 successful, 1 errors]

  • Options
    Ariel_CaryAriel_Cary Vertica Employee Employee

    You should have the FCSVParser function in your database. It has been shipped with the server since Vertica 7.x. Could you try qualifying the name with the public schema?
    public.FCsvParser

    Also, try this query to check the parser function is installed.
    SELECT * FROM user_functions WHERE function_name ILIKE 'FCsvParser';

  • Options

    @Ariel_Cary
    This works! Thanks so much!

    I was using pandas to process my csv files and for some reason the following code replaces each occurrence of double quotes with a backslash followed by TWO sets of double quotes. That's another non-Vertica related problem for me to solve, but currently have a work around.

    frame['TXTMD'] = frame['TXTMD'].str.replace('"',r'\"')

    @skamat
    Neither of the outputs you suggested below are desired. I'm looking for an output that excludes the double quotes that contains each, field, but include the double quotes within the field. So replacing all double quotes with nothing is not desired. I have not tried filler columns in a COPY statement yet, but thanks for showing me that functionality!

    VMART=> select * from test;
    c1 | c2
    -----------------+------------------------------
    "0000000010124" | "SWEEPER STREET: 72" MELROE"
    0000000010124 | SWEEPER STREET: 72 MELROE

    This would be my desired output for the first record in my csv:
    VMART=> select * from test;
    c1 | c2
    -----------------+------------------------------
    0000000010124 | SWEEPER STREET: 72" MELROE

  • Options
    [Deleted User][Deleted User] Administrator

    I agree and like solution from Ariel better.

    Filler columns are very useful when doing transformations during loads. With filler column you can achieve what you want by trimming first and last char.

  • Options

    @skamat
    Good to know about filler columns. Thanks for your help!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file