Preseve Empty String when using FCSV Parser
mark_d_drake
Community Edition User ✭
If I have a CSV file with an empty string
eg
"XXX","","YYY"
and use the following options with COPY:
PARSER fcsvparser(type='rfc4180', header=false, trim=false) NULL ''`,
I get a NULL instead of an empty string for column 2. Is there a setting that will preserve the EMPTY string ?
0
Answers
Maybe a FILLER?
Jim
Thanks for the response, I'll give it a go. I am writing a general purpose tool https://github.com/markddrake/YADAMU---Yet-Another-DAta-Migration-Utilitywhich will generate these commands. How expensive would this be, assuming i Have to apply it to all VARCHAR columns (which would be the majority of the columns loaded). Eg is it worth me adding a switch to allow people to only enable preserving zero length strings if it is important to them.
Also, since trim=false preserves leading and trailing whitespace wouldn't it be logical for it to also treat "" as zero length sting, particularly as ,, is explicitly mapped to NULL.
@mark_d_drake - That's cool that you are writing a migration utility!
About the NULL / Empty String - Are you sure that folks actually don't want NULL? In databases like Oracle, NULL and Empty String are treated as the same thing. But in Vertica, they are different.
About the Vertica COPY, note that the fcsvparser in much slower than using the default delimited parser.
So, in your example, you can use the default parser like this:
Jim
AFAIK (20 +years as a PM at Oracle), Oracle is the only RDBMS that converts empty string to NULL for reasons that are either historical or hysterical depending on your perspective.
As I'm sure you are aware the answer to your question is probably.. Eg if you put 50 customers in a room and ask the question 20 will want one behavior, 20 the other and 10 either won't understand the question or won't care but they will all have a very strong opinion of what the default should be...
The reason I chose CSV is that I use the same code to generate CSV file for ingestion by different databases, Also using CSV seemed a little more 'professional' since it is (in theory) some what well defined by the rfc4180. The other problem with delimited data is that, as a general purpose tool I have absolutely no idea what characters might exist in the source data..
Jim will your solution differentiate between "XXX",,"YYY" and "XXX","","YYY"
E.G. The horrible situation where some of the source columns contained NULL and others contained the EMPTY string.
No. You will have to handle that manually. Maybe with a DECODE?
Example:
But, keep in mind, someone might acually want to load the two single quotes as data into a table...
Jim
I need to stick with CSV, for exactly that reason, I don't know what might be in the table, so I really need a solution where I can differentiate between, and preserve both, NULL and EMPTY string should they occur in different rows of the same table. I can (although it would have a performance overhead) scan the data as I'm preparing the CSV files and split the dataset into 2 parts (those with NULL and those with EMPTY String), but I don't think that really solves the problem as if I then extend the problem to 2 columns I have 4 possible datafiles, and with 3 columns I have 8 etc...