Importing data with Text Qualifiers

I am trying to import data into Vertica using the COPY command from a .txt file.

The file has field delimiter "\t", row delimiter "\n" and text qualifier "". So the data looks like this-

"Data1"       "Data2"       "Data3"        "Data1"       "Data2"       "Data3"        "Data1"       "Data2"       "Data3"        
"Data1"       "Data2"       "Data3"        "Data1"       "Data2"       "Data3"        "Data1"       "Data2"       "Data3"        "Data1"       "Data2"       "Data3"        

Notice how the data in one row spills over to the next line in the .txt file.
If I try to specify my delimiter as '"\t"', I get Error 2730: COPY delimiter must be a single character. Whats the work around to this?

I do not want to change the format of the exported data

Comments

  • You should be able to use the "E" to signify you want the specified delimiter escaped and treated as a literal. Example below for the tab field delimiter.

    ephemeraltest=> \! cat test.txt
    "data1" "data2" "data3"
    "data1a"        "data2a"        "data3"

    ephemeraltest=> copy test from '/home/dbadmin/test.txt' delimiter '\t';
    ERROR 2730:  COPY delimiter must be a single character
    ephemeraltest=> copy test from '/home/dbadmin/test.txt' delimiter E'\t';
     Rows Loaded
    -------------
               2

    I hope it helps


  • Hi Kunal,

    Several things here:

    First, strings in native standards-compliant SQL don't support backslash-escaping.  They don't need it -- all characters, including meta-characters, etc., are interpreted literally (except for the single-quote character, which is escaped by having two of them in a row).  So you probably want that string to be ['"    "'] (a literal tab; not sure how well that will render in HTML).

    Alternatively, if you want C-style backslash escape sequences, Vertica supports them via extended strings:  [E'"\t"']

    Second, what you refer to as a "qualifier" is what we call the "enclosing character".  I think you don't want [DELIMITER E'"\t"']; I think you want [DELIMITER E'\t' ENCLOSED BY'"'].

    In terms of data from one row spilling over from one line in the text file to the next -- it sounds to me like you do not in fact have a row delimiter?  You indicate a row delimiter "\n", but as you point out, I don't see any newlines in this example data.  Vertica does require a row delimiter (what we call a "record terminator"), and you must have one after each record.  This is how Vertica validates that it has correctly received a record, and recovers from invalid records.

    If you don't have a record terminator, but you have a format which would be amenable to inserting one, one option would be to compile and install the "ExternalFilter()" UDx from Vertica's github site (<http://github.com/vertica/Vertica-Extension-Packages/> -- in shell_load_package) and tell it to stream your data through "sed" and replace every third tab with a newline.  Note that, while this package is used by a number of Vertica users, it doesn't receive professional support/documentation; you have to figure it out yourself or by asking here on the forums.

    Another alternative, if you need custom parsing logic, would be to write a custom parser for your file format.  Vertica provides a "UDParser" API (C++ and Java) for extending the Vertica executable to support new parsing formats.  If you're interested in doing this, see "/opt/vertica/sdk/examples/ParserFunctions/" on any node in your Vertica cluster for some examples.

    Adam
  • Yes, but I want the \t AND the "" gone. The "" are there because my Data might have tabs in it, and I don't want that to be read as the field delimiter when its not.
  • Solved this - ENCLOSED BY feature of the COPY command

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file