How to dump data to "native" file format?

wkoffel · April 2013

I'd like to dump data via vsql to a file to load into another Vertica database. My efforts to do this via EXPORT/COPY FROM VERTICA have failed for various reasons (seems like lack of proper community edition support for it, and version difference between v5 and v6 of the two DBs) I can export data using something like: vsql -U user -w pwd -h vertica.mydomain.com -d dbname -At -c "SELECT * from schema.users" > /migrate/users_export.txt And then load it in with: cat /migrate/users_export.txt | vsql -U user -w pwd -d dbname -c "COPY schema.users FROM STDIN DELIMITER '|';" The issue I'm hitting is that newlines in certain data fields cause the COPY command to miss data. For example, if a user has a biography field with new lines, it breaks in the output file. Ideally, I'd like to send the results of the SELECT statement output to 'native varchar' format, and load them like that. Is there an option for that? And is the native format compatible between major revisions? If not, any tips on properly escaping and reloading the data via COPY?

[Deleted User] · April 2013

Hi Will, we'll look into this and find a way to remedy the problems you're running into.

wkoffel · April 2013

Thanks Danielle, really appreciate it. I'm sure there must be an easy way to move data from Vertica to Vertica, but I feel thwarted at every turn. I'll wait to hear from you, let me know if some other support channel might be more efficient.

[Deleted User] · April 2013

So, EXPORT really is the right way to do this; it will work much better than any of the alternatives. v5 to v6 definitely works / is supported. But unfortunately that feature is not currently available in the Community Edition. I assume you're migrating from v5 to v6 (rather than the other way around)? You could look into the ODBC Loader that's been uploaded to our github site (https://github.com/vertica/Vertica-Extension-Packages). It allows you to have any Vertica 6.0+ database connect to any other database (Vertica or otherwise) over ODBC and load data from it. Use with another Vertica is definitely unsupported and probably slow, but if you're using the CE, you could give it a try and see if it works for you. If you're stuck using vsql, a hacky workaround is to pass "-F '...' -R '...'" to vsql (see "vsql --help"), to set the field separator and record terminator to something other than | and \n, respectively. (And set DELIMITER and RECORD TERMINATOR accordingly when you load the data with COPY.) Then you don't have to escape newlines any more. You do have to escape whatever character you choose. But you can pick something obscure like the ASCII bell character, or any other character/byte that doesn't appear in your data. (Note that you can escape newlines with a backslash, if you're generating data on your own. vsql unfortunately doesn't know how to do this.) Regarding the NATIVE format, it does not change between major releases. It is in fact fully specified in the documentation. If you're particularly interested in NATIVE, you could in fact write a little program to generate it. I don't know offhand of Vertica having shipped such a program; customers are generally able to use one of the above solutions.

wkoffel · April 2013

By way of closure for others who see this thread. Here's how it played out for me... My data has free text fields in it, which means that all manner of characters might be included. Regular pipes and newlines were insufficient as a separator. Since EXPORT isn't an option, and I had more data than I wanted to shuffle via ODBC, I did custom separators and handled some gotchas along the way. I had servers old (Enterprise v5) and new (CE v6) 1. export_catalog() to a catalog.txt file, load that onto new server 2. Export all relevant tables to text files on old server. I used the ASCII characters GS and RS (group and record separator, octal 035 and 036) for my data. So the export line looked like this: vsql -F $'\035' -R $'\036' -U dbadmin -w pass -h vertica-old -d dbname -At -c "SELECT * from schema.post" > /migrate/post.txt 3. I moved all those files to the new server for loading. 4. I found I needed to do a few things for a clean load. I needed to escape any backslashes that were in the files, otherwise a backslash at the end of a text field would escape the group separator \035 and mess up the line. cat /migrate/post.txt | sed 's/\\/\\\\/g' | vsql -U dbadmin -w pass -h vertica-new -d dbname -c "COPY schema.post FROM STDIN DELIMITER E'\035' RECORD TERMINATOR E'\036' REJECTMAX 100 REJECTED DATA '/migrate/post_rejected.txt' EXCEPTIONS '/migrate/post_exceptions.txt';" This got me some of the way there, however I found that very large files didn't load. After many hours of churning through large files (my single biggest was about 165GB), I would see a message in the log that said something like "0 records inserted, 0 records rejected". No errors, no exceptions, but no data loaded. To handle those large files, I patched together an awk script that would split the files into smaller pieces. (in case anyone needs it: http://cl.ly/code/1n2V2z2i342C) I ran that to split files into 10MM records each, and then wrote a shell script to sequentially load them using the commands above. With all that done, there were a few exceptions (less than a dozen in my case), places where unicode in free text fields had conflicted with my carefully chosen record separators. I manually went through those (be sure to send load exceptions and rejected data somewhere so you can review later) and handled them case-by-case.

[Deleted User] · April 2013

Thanks for the update! Just for future reference, you might be interested in the "NO ESCAPE" directive to COPY. It disables backslash-escaping, which it sounds like is what you would like in this case. I'm also surprised you had trouble with files of that size. 165GB is big, but it's not all *that* big for Vertica... Is it possible that you somehow got the wrong record terminator or delimiter for that case, so that it never found a complete record? In any case, if you can learn anything else about that case, definitely feel free to let us know.

wkoffel · April 2013

Thanks for the pointer to NO ESCAPE, I definitely would've used that! Yeah, I figured big files would be okay. My only guess was that maybe they were a problem because they had only one line in them? Perhaps typical loads break on newlines. Maybe a 2GB limit on how large a single line can be? It felt like files over about 2GB needed to be split, and less were okay. This is a 64-bit machine, I didn't get a sense that there was any resource-shortage issue. But I didn't dig too deeply once I discovered that splitting the file made the load reliable, I went that direction. Took too long to debug when a single run takes hours! :-)

[Deleted User] · April 2013

Hm... Load shouldn't be newline-specific; it's looking for whatever character you specify as a record terminator and breaking on that. Though if you have a single record that's >2gb, that could well cause a problem; that's much larger than Vertica's internal max record size so it will likely error out. Oh well. Glad it works for you!

Adam_1 · August 2016

But how could I dump data in the native binary format?

Adam_1 · August 2016

By the way, you have an error in the signature for the native binary format, it should be '\377' instead of '\317' for the byte offset 7: https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/AdministratorsGuide/BinaryFilesAppendix/CreatingNativeBinaryFormatFiles.htm (Text literals).

We're Moving!

Create My New Community Account Now

How to dump data to "native" file format?

Comments

Leave a Comment