Importing CSV files with text fields that contain double quotes using COPY
Hi - I've been having trouble using the COPY command to import CSV files that include double quotes within a text field.
I can get around this by creating the CSV file with the double quotes within the field TXTMD removed, but I would like to have the double quotes in the Vertica table.
COPY DataBase1.DF_EQUIPMT_TEXT
FROM LOCAL 'C:\work\double quotes test UTF8.csv'
DELIMITER ','
ENCLOSED BY '"'
NULL ''
ENFORCELENGTH
REJECTMAX 0
SKIP 1;
Here is the CSV file:
"DF_EQUIPMT","TXTMD"
"0000000010124","SWEEPER STREET: 72" MELROE"
"0000000010124","SWEEPER STREET: 72"" MELROE"
"0000000010000","ILLUMI INFR AN/PEQ-2A"
"0000000010000","TELEVISION RECEIVER HOME TYPE, 32" FLAT"
"0000000010000","TELEVISION RECEIVER HOME TYPE, 32"" FLAT"
"0000000010000","COMPUTER, PERSONAL WORKSTATION"
I've tried saving my CSV file using ANSI and UTF-8 encoding.
Here's my output using the above COPY command:
DF_EQUIPMT TXTMD
0000000010000 ILLUMI INFR AN/PEQ-2A
0000000010000 COMPUTER, PERSONAL WORKSTATION
Desired output:
DF_EQUIPMT TXTMD
0000000010124 SWEEPER STREET: 72" MELROE
0000000010124 SWEEPER STREET: 72"" MELROE
0000000010000 ILLUMI INFR AN/PEQ-2A
0000000010000 TELEVISION RECEIVER HOME TYPE, 32" FLAT
0000000010000 TELEVISION RECEIVER HOME TYPE, 32"" FLAT
0000000010000 COMPUTER, PERSONAL WORKSTATION
Any help would be greatly appreciated.
Thanks!
Trevor
Comments
Have you tried filler columns in copy statement ?
VMART=> create table test(c1 varchar(100), c2 varchar(100));
CREATE TABLE
VMART=> ! vi /tmp/test.dat
VMART=> COPY test from '/tmp/test.dat' DELIMITER ',';
Rows Loaded
(1 row)
VMART=> select * from test;
c1 | c2
-----------------+------------------------------
"0000000010124" | "SWEEPER STREET: 72" MELROE"
(1 row)
VMART=> COPY test(x filler varchar(100), y filler varchar(100), c1 as replace(x,'"','') , c2 as replace(y,'"','')) from '/tmp/test.dat' DELIMITER ',';
Rows Loaded
(1 row)
VMART=> select * from test;
c1 | c2
-----------------+------------------------------
"0000000010124" | "SWEEPER STREET: 72" MELROE"
0000000010124 | SWEEPER STREET: 72 MELROE
(2 rows)
Hi Trevor,
Double quotes in the data is an interesting case. The general answer is that you need to escape your double-quote characters in your data. Then you can use the FCSVParser parser function telling it what your escape character is.
(Also, note that there is no real CSV standard. There is the RFC4180 specification which does a CSV format, but different applications tend to accept variations of that CSV format. In Vertica, we support the RFC4180 type. But we also add another type that we call TRADITIONAL.)
In the following example, the TRADITIONAL FCSVParser and a backslash ('\' ) as a escape character.
CREATE TABLE DF_EQUIPMT_TEXT(DF_EQUIPMT varchar, TXTMD varchar);
-- Take the data from stdin so I can inline the data, but you change it to a file source
-- NB: You escape every " with the escaping character ('\')
COPY DF_EQUIPMT_TEXT FROM STDIN PARSER FCSVParser(type='traditional', escape='\');
"DF_EQUIPMT","TXTMD"`
"0000000010124","SWEEPER STREET: 72\" MELROE"
"0000000010124","SWEEPER STREET: 72\"\" MELROE"
"0000000010000","ILLUMI INFR AN/PEQ-2A"
"0000000010000","TELEVISION RECEIVER HOME TYPE, 32\" FLAT"
"0000000010000","TELEVISION RECEIVER HOME TYPE, 32\"\" FLAT"
"0000000010000","COMPUTER, PERSONAL WORKSTATION"
.
SELECT * from DF_EQUIPMT_TEXT;
DF_EQUIPMT | TXTMD ---------------+------------------------------------------
0000000010124 | SWEEPER STREET: 72" MELROE
0000000010124 | SWEEPER STREET: 72"" MELROE
0000000010000 | ILLUMI INFR AN/PEQ-2A
0000000010000 | TELEVISION RECEIVER HOME TYPE, 32" FLAT
0000000010000 | TELEVISION RECEIVER HOME TYPE, 32"" FLAT
0000000010000 | COMPUTER, PERSONAL WORKSTATION
(6 rows)
You could also use the RFC4180 type parser. But the default escape character is a double-quote character; you cannot choose another character (from that perspective, the TRADITIONAL type is more flexible). You're data should then include double-quote twice (i.e. "") for each double-quote character in your data.
Hope this helps.
@skamat - I have not tried filler columns in the COPY statement. I will try this.
@Ariel_Cary - I will try this as well. I have been combining XLSX documents into a CSV via pandas and I can precede ever double quote with the escape character ('\') and use the FCSVParser.
Thanks so much for your responses. I will try these both when our Vertica database is back up and running. I very much appreciate the help and have not tried these two options yet.
@Ariel_Cary
I'm on Vertica Analytic Database v8.0.1-0. My error says that the function fcsvparser() does not exist when I run the COPY command below. Any suggestions? I feel like this solution should work for my CSV file. I've replaced all occurrences of double quotes with '\"'
Error
12:19:08 [COPY - 0 rows, 0.012 secs] [Code: 3457, SQL State: 42883] [Vertica]VJDBC ERROR: Function fcsvparser() does not exist, or permission is denied for fcsvparser()
... 1 statement(s) executed, 0 rows affected, exec/fetch time: 0.012/0.000 sec [0 successful, 1 errors]
You should have the FCSVParser function in your database. It has been shipped with the server since Vertica 7.x. Could you try qualifying the name with the public schema?
public.FCsvParser
Also, try this query to check the parser function is installed.
SELECT * FROM user_functions WHERE function_name ILIKE 'FCsvParser';
@Ariel_Cary
This works! Thanks so much!
I was using pandas to process my csv files and for some reason the following code replaces each occurrence of double quotes with a backslash followed by TWO sets of double quotes. That's another non-Vertica related problem for me to solve, but currently have a work around.
frame['TXTMD'] = frame['TXTMD'].str.replace('"',r'\"')
@skamat
Neither of the outputs you suggested below are desired. I'm looking for an output that excludes the double quotes that contains each, field, but include the double quotes within the field. So replacing all double quotes with nothing is not desired. I have not tried filler columns in a COPY statement yet, but thanks for showing me that functionality!
VMART=> select * from test;
c1 | c2
-----------------+------------------------------
"0000000010124" | "SWEEPER STREET: 72" MELROE"
0000000010124 | SWEEPER STREET: 72 MELROE
This would be my desired output for the first record in my csv:
VMART=> select * from test;
c1 | c2
-----------------+------------------------------
0000000010124 | SWEEPER STREET: 72" MELROE
I agree and like solution from Ariel better.
Filler columns are very useful when doing transformations during loads. With filler column you can achieve what you want by trimming first and last char.
@skamat
Good to know about filler columns. Thanks for your help!