How to load US7_ASCII strings with FAVROPARSER
I am loading "many" TB of data daily.
It appears, AVRO file format as intermediary for Vertica loading would be a quite efficient way of loading very large data.
I can see FAVROPARSER have documented support for all Vertica data types.
How about loading strings that are US7_ASCII. AVRO by definition have strings as unicode.
Problem with unicode strings is that they are at least twice bigger in size than US7_ASCII strings.
My ballpark estimates is that around 30-40% of my data are strings or fixed chars.
If I will switch to AVRO as intermediate file format for Vertica loading, I will have additional daily data size increase in range of few TB (in uncompressed size). Not to mention conversion from ASCII chars to unicode and back on AVRO binary encoding/decoding.
That to say, AVRO format would be at least on par or better than currently used data format for loading, even with unicode string data size increase.
I would be glad if you will post here how to load US7_ASCII strings though AVRO format with FAVROPARSER (into columnar table).
Would be extremely nice if FAVROPARSER will add support for logical data type "US7_ASCII" annotation on top of bytes and fixed types. That would arguable make AVRO format most efficient and best intermediary for Vertica data loading.
With AVRO unicode strings, I am scratching my head, is it worse efforts.
Answers
In general, US7-ASCII is a subset of UTF8, and Vertica supports UTF8.
Some pros and cons about AVRO to consider..
AVRO format is good if you need the ability for the file structure to change, but it is not columnar, not optimized for heavy reads, with poor compression efficiency and poor ability to split files for parallel activity.
It is slow for data load: AVRO file can take 30 times or more longer to load than it takes to load a CSV file.
It dose note have the best compression: AVRO file size on disk can take almost twice the space compared to Parquet.
Big thanks for comment (I really mean it!).
Interesting fact about speed of load for AVRO file. I would expect it to be faster than CSV, as in binary encoding, should be very efficient - no data conversion required. Seems to be, AVRO loader do need to do conversion from AVRO binary format to format usable for Vertica API setters. Most obvious conversion is from UTF8 to us7_ascii, for string setters. Another thing why AVRO parser can be slow is inefficient mapping from columns in AVRO file into columns in parser output. Support for arbitrary record structure with nested substructures also does not improve speed.
I got impression I can create my custom format, serving as intermediate format for Vertica loading. Format will be very narrow in scope and will serve only for most efficient Vertica loading. Primary goal would be fastest possible Vertica parser, for all Vertica-supported datatypes. Secondary goal would be fastest possible file generation API for application use. Tertiary goal would be smallest possible size. Of course, it will be large block format, allowing parallel processing of each block by several parsers on several nodes in one copy command, without need for apportioning or similar technique. Block will be compressed.
Supergoal would be to beat on speed any other data format in chain:
Application file generation -> file store on disk -> file read from disk and send to Vertica -> UDx parser parses format.
I wrote my "quite a lot" of UDx c++, and I clearly understand what I am talking about. Hopefully, will show you result on next Vertica conference.