COPY issue loading enclosed data with NULLs

I am trying to load file, which columns are enclosed by double quote ("), and delimiter is a pipe (|). Null is empty string Target table TEST.ATTRIB created without NOT NULL constraints, allowing to have NULL in any column. Example of the file I am loading: "27713"|"2"|"KCWAG_PACKSIZE"|"KCWAG_PACKSIZE"|"KCWAG_PACKSIZE"|""|"1" "27814"|"2"|"PROMOTED_GROUPING"|"PromotedGroupin"|"Promoted_Grouping"|"PROMOTED_GROUPING_KEY"|"" The first row has NULL (empty string) in column 6 Second row has NULL value in the last column, column 7 This is a command I used to load the data: copy TEST.ATTRIB from LOCAL '/home/oleg/ATTRIB.txt' enclosed by '"' delimiter '|' NULL '' REJECTED DATA '/tmp/oleg/ATTRIB_rejected.txt' EXCEPTIONS '/tmp/oleg/ATTRIB_exceptions.txt' direct; From above rows, COPY loaded only 1st row. Second failed with error: COPY: Input record 2 has been rejected (Invalid integer format '' for column 7 (DEF_ATTRIB_DATA_TYPE_KEY)). Please see /tmp/oleg/ATTRIB_rejected.txt, record 1 for the rejected record. This record was read from ATTRIB.txt But, COPY loaded 1st row correctly, accepting column 6 as NULL value. And COPY failed to load second row having NULL at the last column. Please do not suggest not to use "enclosed by" as we have many data to load where we have to use enclosure option. Please advise.

Comments

  • Hello Oleg: Your table definition shouldn't specify NOT NULL. That constraint means the column value cannot be null. The opposite (let the column value be null) is the default, so you do not need to specify it. I created a table based loosely on your input: VMart=> create table copynull ( VMart(> item_key varchar, VMart(> rating varchar, VMart(> packsize varchar, VMart(> packsize2 varchar, VMart(> packsize3 varchar, VMart(> grouping varchar, VMart(> popularity varchar) VMart-> order by item_key; CREATE TABLE Then a dummy file (copynull.sql) with similar content to what you listed: "27713"|"2"|"KCWAG_PACKSIZE"|"KCWAG_PACKSIZE"|"KCWAG_PACKSIZE"|""|"1" "27814"|"2"|"PROMOTED_GROUPING"|"PromotedGroupin"|"Promoted_Grouping"|"PROMOTED_GROUPING_KEY"|"" Last, I used COPY LOCAL as follows: VMart=> copy copynull from local '/home/dbadmin/temp/copynull.sql' enclosed by '' delimiter '|' NULL '' rejected data '/home/dbadmin/temp/copynull_rejected.txt' exceptions '/home/dbadmin/temp/copynull_exceptions.txt' direct; -[ RECORD 1 ]-- Rows Loaded | 2 This appears to have loaded correctly, so let me know if this is what you wanted: VMart=> select * from copynull; -[ RECORD 1 ]----------------------- item_key | "27713" rating | "2" packsize | "KCWAG_PACKSIZE" packsize2 | "KCWAG_PACKSIZE" packsize3 | "KCWAG_PACKSIZE" grouping | "" popularity | "1" -[ RECORD 2 ]----------------------- item_key | "27814" rating | "2" packsize | "PROMOTED_GROUPING" packsize2 | "PromotedGroupin" packsize3 | "Promoted_Grouping" grouping | "PROMOTED_GROUPING_KEY" popularity | "" Thanks. Kanti
  • Hi Kanti, Thanks for replay. You DID NOT use enclosed by '"' - enclosure by double quote, as I did, in your COPY command. That's why you loaded 2 rows, all with double quotes loaded directly to columns, which is wrong. Also, I specifically mentioned I did not use NOT NULL, actually: "Target table TEST.ATTRIB created without NOT NULL constraints, allowing to have NULL in any column." And this is my exact test which shows I stil lhave an issue: 1) $ cat ATTRIB.txt "27713"|"2"|"KCWAG_PACKSIZE"|"KCWAG_PACKSIZE"|"KCWAG_PACKSIZE"|""|"1" "27814"|"2"|"PROMOTED_GROUPING"|"PromotedGroupin"|"Promoted_Grouping"|"PROMOTED_GROUPING_KEY"|"" 2) Table create table tmp.oleg_test ( col1 int, col2 int, col3 varchar(100), col4 varchar(100), col5 varchar(100), col6 varchar(100), col7 int); 3) Load dbadmin=> copy tmp.oleg_test from LOCAL 'ATTRIB.txt' enclosed by '"' delimiter '|' NULL '' REJECTED DATA 'ATTRIB_rejected.txt' EXCEPTIONS 'ATTRIB_exceptions.txt' direct dbadmin-> ; Rows Loaded ------------- 1 (1 row) Still have one row only in target ~ Oleg
  • My apologies for that Oleg...it's been a long week.
  • Yep.. mine too :) I created a ticket with Vertica, let's see what they can suggest. Thank you for taking your time. ~ Oleg
  • Hi Oleg, the NULL specification is evaluated before quoting; the mechanism intentionally differentiates between '""' (ie., the empty string) and '' (ie., NULL). You can explicitly tell Vertica to replace empty-string values with NULLs; for example, COPY tmp.oleg_test ( col1, col2, col3_filler FILLER VARCHAR(100), col3 AS CASE WHEN col3_filler = '' THEN NULL ELSE col3_filler end, col4, col5, col6, col7 ) from LOCAL 'ATTRIB.txt' enclosed by '"' delimiter '|' NULL '' REJECTED DATA 'ATTRIB_rejected.txt' EXCEPTIONS 'ATTRIB_exceptions.txt' direct; will set the third column to NULL if its value is the empty string. You can replicate the logic for any other columns that you care about, and/or modify if you want to perform other actions. Unfortunately, I don't believe there's any less-verbose solution to this scenario... The regular COPY arguments are intended to handle the cases that Vertica encounters most often; we provide tools like COPY expressions to let users provide custom logic to handle additional cases.
  • Thanks Adam, Well, I think this is a lack of COPY functionality, IMO - all other DBs' "loaders" I worked with, handles such cases without extra efforts.. I do not think the given scenario is a rare, actually. Appreciate your help, will try this approach. ~ Oleg
  • Well, that's what the folks who asked for the current implementation said about the way it was before :-) We handle some files better than any given other loader, and it will handle some formats better than us. (If your data were comma-separated but otherwise the same, ie., valid IETF RFC-4180 CSV format, the solution would actually be much simpler. If you're curious, you might want to take a look at our parsing plugins, installed to /opt/vertica/sdk/examples/Parser* .) Regardless, your point is noted; we can always do better. If you'd be willing to include some details with your support ticket about what tools are generating your particular data, etc., that would help folks on this end know what formats are important to look at next. Also, let us know if the approach above works for you.
  • Adam, This approach works. But it makes things much more complex - generating load scripts on fly as I did, became much more complex now - I need to get all column names, find data types, NULL/NOT NULL, construct COPY with FILLERS to all nullable columns, etc.. Comma separated will not work, if you have columns with text having commas - we do have such. How do you handle if a field has CR/LF? will be such row ignored also? I can use any delimiter, beside pipe. Not sure this will help actually. And my understanding, the ticket will not be solved soon, so I have to use your suggestion and implement logic I described.. sigh.. Thank you, Oleg
  • Are all of your fields quoted? From your sample it looks like they are. Then it's fine to have a comma in your data; we pull out everything within the quotes. If you can use commas, you might want to try out the Rfc4180CsvParser example, see if it works correctly for you. It unfortunately does require some work to get it set up the first time -- it ships as source code (so that developers can modify it if they want), so you have to copy the "examples" directory to your home directory and run "make" from inside of it to compile it. Then load it by running CREATE LIBRARY and CREATE PARSER statements (examples in ParserFunctions.sql). Once you have it working, you just do (for example) copy t from '/path/to/datafile' with parser LibCSVParser(); If it does in fact work with your file data, you won't need to type out any of these copy expressions; you won't even need to specify the enclosed-by, delimiter, or null fields globally.
  • Thanks Adam, I will give it a try (Rfc4180CsvParser.cpp)
  • Isn't this a bug? The enclosed empty string became NULL when it wasn't in the last column, but didn't become NULL when it was in the last column? I noticed behavior like this recently without using enclosed by -- I was using NULL to represent an actual null value within my csv file, and if NULL was in the last column, it didn't become an actual null value. I reproduced it in 6.1, I just haven't gotten around to opening a support case for it. --Sharon
  • Hi Sharon: That issue with NULLs and the last column does sound like a bug. If you can reproduce, please do file a support case. (I'm not immediately able to reproduce it from your description -- maybe I'm not quite understanding the issue?) Oleg's issue is, I believe, unrelated and independent of column (?): Vertica's default delimited parser is built to be able to handle file formats where all literal strings are valid (so there's no constant that you can pick out to represent NULL), where quoting is instead used to specify whether a value is to be interpreted literally or not. For example, "Here's a column"|"Here's another column"|"Maybe I want a literal"|"NULL"|"as a column value, but"|"some columns just aren't"|NULL|"specified." where only the second-to-last column is to be NULL. So it's a feature. Is it also a bug? Well, maybe. How frustrating is it for the average Vertica user? Vertica is not another "one-size-fits-all" database (that's not where the industry's going); we're solving a few problems really really well and expanding from there. (For example, Oleg's original description of the file format sounded suspiciously like the "OpenOffice Calc" dialect of CSV. Calc files can be on the small side for Vertica; not to say we don't see them, but the dialect is more verbose than what customers have reported to us that they typically see.) That's why I asked Oleg (and would ask anyone else reading this thread) to tell your Support and Sales reps what you do with Vertica, and what you want to do with Vertica.
  • Incidentally, since I seem to have an audience: If you're a software engineer and you don't like our COPY, you can change it :-) We've posted the source code to several Vertica parsing engines (some installed to /opt/vertica/sdk/examples/, others on http://github.com/vertica/ ); tweaking things like their NULL-handling would be pretty straightforward. If you do this and you post a new cool parser that people use, I can guarantee you the appreciation and attention of the Vertica COPY folks :-) Of course, this is not something that we would ask of ordinary Vertica users. Only if you're also a developer and you're interested.
  • Well, Adam, I will consider this Vertica behavior as a bug, actually. Of course there is a way around, rebuilding your parser, using fillers, etc.. If Vertica provides an option of what NULL is in COPY command (default NULL '' ), it should use that symbol as NULL. But is does not. How Vertica is going to deal with a file which has multiple lines in one cell? It will nto deal. Oracle's sqlloader is much more flexible, to compare. I have a case with Vertica re: this. I do not have much time to review and modify parser. If my manager approves, I will do. Regards, Oleg
  • Oleg - you can load a file which has multiple lines in a cell by specifying a custom RECORD TERMINATOR. Adam - Thanks for the reply. The part of Oleg's problem description that sounds similar to what I'd observed was this: " The first row has NULL (empty string) in column 6 Second row has NULL value in the last column, column 7 . . . From above rows, COPY loaded only 1st row. Second failed with error: COPY: Input record 2 has been rejected (Invalid integer format '' for column 7 " The fact that it worked in column 6 but not in the last column. I believe my scenario where I saw the behavior that I described above was with a non default RECORD TERMINATOR. I believe it didn't reproduce with the default. If you create a reproducer, please report back here so I don't duplicate efforts. Otherwise I'll try to circle back to this in a week or two. --Sharon
  • Sharon: Hm... Well, the empty string is a valid VARCHAR (column 6) and not a valid INTEGER (column 7). So that outcome doesn't surprise me -- since everything's quoted so NULLs effectively aren't being parsed here, the first row gets an empty string, and the second row gets a parse error. I can reproduce that, but I'm less sure it's a bug -- it's just a difference between data types. NULL-handling aside, as discussed above. Oleg: If you're able to use an external tool like sqlloader, rather than something built into the database server like the SQL COPY statement, there are many existing tools out there that can do that for you. I'm not the right person to tell you about them, though. If you'd like to go in that direction, you could post another question asking what ETL tools people use. (Or people might pipe in here, but it's a long thread; folks may not see it.) It's true that none of these are currently provided by Vertica. Though some are free, and are quite powerful. Though now I feel like I've given you entirely too much random information and not nearly enough solution for your problem...
  • Also, Oleg, thanks for your feedback on COPY. I can't make any promises about future plans at this point. (If you want official commitments, I'm afraid you'll have to go through Support.) But your concern and your problems have been heard.
  • Thanks Adam, I am using COPY with named pipe, and it worked really fast to unload data from Oracle and load into Vertica. Besides the issues I mentioned, I am OK with that. Thanks, Oleg
  • Hi Adam, what would you suggest for the following situation, my old MySQL table dump has NULL entries for a TIMESTAMP column :P however in my new table in Vertica the column is TIMESTAMP NOT NULL, so, COPY is returning the following error: COPY: Input record 4 has been rejected (Invalid timestamp format 'NULL' for column 5 (joindate).Invalid input syntax for timestamp: "NULL") tkx
  • Hey Wils, so in this case you've changed the schema so you'll have to do some amount of data cleaning to get the data into the form that you'd like. What would you like to have happen in this case? If you want to replace NULL records with default values, you can do that with the technique above. If you want to get rid of NULL records, unfortunately there's no way to reject individual records from within a COPY expression right now; the easiest solution might be to create a temp table that uses regular TIMESTAMP, then load into that table, then INSERT..SELECT from that table into your real table with a predicate like "myTimestamp IS NOT NULL".
  • Hey Adam, thank you, I will keep this in mind in my journey =D however I have fixed my issue using: NULL as 'NULL' in the COPY command. tkx
  • Hi, Oleg.

     

    Has this ticket been resolved by vertica team?

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file