vertica provides 2 kinds of NATIVE Parsers. one is NATIVE(Binary) and other is NATIVE VARCHAR.. NATIVE binary provides more performance Since binary-format data does not require the use and processing of delimiters, it precludes the need to convert integers, dates, and timestamps from text to their native storage format, and improves load performance
However with avro, Avro file must have its related schema in the file being loaded. one advantage of using Avro format is you can load complex types into the table.
Thanks for your reply. We’re not having any issues, just kind of figuring which path to take, still designing/planning.
This is helpful.
We caught this disclaimer in documentation:
“You can load binary data using the NATIVE parser option, except with COPY LOCAL, which does not support this option. “
Would you advise a next-best approach that we can take to exploit NATIVE(binary) ?
This is an odd comparison. Loading a NATIVE filetype is a complex endeavor, and would result in a very fast load, though the amount of work required to get it into that format might be cumbersome. I've not known many clients to embark on that path. We don't have any comparisons between NATIVE and anything else, that I'm aware of.
I will say that if the question is around filetypes and load performance, you'll find AVRO to be the slower in a comparison to Parquet, which seems to be emerging as the dominant Hadoop-esque file type. The internal structures of AVRO do not allow it to be parallelizable internally, and Parquet can be. So, you might consider benchmarks between Avro and Parquet. I think you'll be pleasantly surprised.
The goal behind the question is to know if it is worth moving to a different format, binary in this case. Intuitively we think binary (if done right) would be faster than text formats. And since Vertica now supports Avro and Avro is a known format, it would be preferred instead of the NATIVE binary since there are already libraries in any language supporting writing in Avro format.
Outside that Avro is only supporting STDIN starting in Vertica 11.1, I was curious if Vertica has some benchmarks on load performance with various formats, from the text formats (DSV-like) to binary.
I know it is supported, the question is, "which one is faster" ... NATIVE binary (we presume, but proprietary)...between Avro or CSV which is faster? We presume Avro.
"Avro would be faster than CSV"
That is a questionable statement.
Avro and Vertica Native formats have header. Performance of data loads in formats with headers can be described as awkward.
Vertica parsers do not support internal parallelisation of data formats with header, i.e. no apportioning or cooperative loads. Both Avro and Native formats are being parsed strictly in single thread on single node.
CSV and JSON data formats support cooperative, i.e. multithreading parse. Vertica uses 4 threads for parsing. No apportioning parsing, i.e. only single node participate in parsing.
Last time I checked, CSV and JSON cooperative parsing was working only when vsql accepts data from stdin, copy from local do not use cooperative parsing. That makes load from stdin 2-3 times faster than from local. JDBC was not able to use cooperative parse. Disclaimer - may that changed in recent versions.
Both Avro and Native formats are binary, and have advantage of not need expensive string conversions during parsing. But both can be parsed in single thread only on single node.
CSV and JSON are parsed in 4 threads. But they are string formats and need expensive conversions.
You are talking about parsing binary formats in one thread vs parsing CSV format in 4 threads, both times on single node.
In my experience, CSV parsing can beat Native formats, if you are more careful on generating data. Specifically, do not put any date or time or timestamp in string formats. Use number of days since epoch for date, number of micros since midnight for time, number of micros since epoch for timestamp. With this format, CSV load reliably beats Native format on speed. I have not compared to Avro, but due to Avro internal compression algos, parsing of Avro will be slower than Native.
Unfortunately, Vertica do not have date/time/timestamp formats for number of days/number of micros since midnight/epoch. You will need to write c++UDx that takes int and produces Vertica data type, and have it in unfenced mode, and use it in copy transform. It is trivial but out of reach of vast majority of users.
Side advantage of avoiding date/time/timestamp string formats is much faster generation of data files for Vertica loading. I can easily get 4 x times speed up. I am always considering time to generate data files for Vertica loading as part of total load time.
All above is for loading of file from client server. If you involve Kafka or Spark, things became more interesting. When data already reside on Kafka/Spark, load into Vertica is lightning fast and none of above is applied.
Problem is that you need to upload your file into Kafka first, and it is essentially single-threaded process. Total time of uploading to Kafka and them loading into Vertica will be easier slower than straight load into Vertica using vsql or ODBC. Do not forget you need to run sizeable infrastructure to host Spark/Kafka.
Vertica CSV parser is outrageously fast. It is the most used parser, no wander it is highly optimised and deliver very well. I spent quite a lot of time trying to beat CSV loads with binary formats, and that was not easy task, require very non-standard approach and huge amount of coding.
Answers
vertica provides 2 kinds of NATIVE Parsers. one is NATIVE(Binary) and other is NATIVE VARCHAR.. NATIVE binary provides more performance Since binary-format data does not require the use and processing of delimiters, it precludes the need to convert integers, dates, and timestamps from text to their native storage format, and improves load performance
However with avro, Avro file must have its related schema in the file being loaded. one advantage of using Avro format is you can load complex types into the table.
Are you facing any performance issues?
Thanks for your reply. We’re not having any issues, just kind of figuring which path to take, still designing/planning.
This is helpful.
We caught this disclaimer in documentation:
“You can load binary data using the NATIVE parser option, except with COPY LOCAL, which does not support this option. “
Would you advise a next-best approach that we can take to exploit NATIVE(binary) ?
Thanks again,
Kent
@kent_hoover : Yes you can use NATIVE binary. However if you plan to load complex types then you need to use other parsers as needed.
even using COPY LOCAL ?
yes.. you can use COPY local to load complex types. for NATIVE, as you mentioned, COPY LOCAL cannot be used.
This is an odd comparison. Loading a NATIVE filetype is a complex endeavor, and would result in a very fast load, though the amount of work required to get it into that format might be cumbersome. I've not known many clients to embark on that path. We don't have any comparisons between NATIVE and anything else, that I'm aware of.
I will say that if the question is around filetypes and load performance, you'll find AVRO to be the slower in a comparison to Parquet, which seems to be emerging as the dominant Hadoop-esque file type. The internal structures of AVRO do not allow it to be parallelizable internally, and Parquet can be. So, you might consider benchmarks between Avro and Parquet. I think you'll be pleasantly surprised.
The goal behind the question is to know if it is worth moving to a different format, binary in this case. Intuitively we think binary (if done right) would be faster than text formats. And since Vertica now supports Avro and Avro is a known format, it would be preferred instead of the NATIVE binary since there are already libraries in any language supporting writing in Avro format.
Outside that Avro is only supporting STDIN starting in Vertica 11.1, I was curious if Vertica has some benchmarks on load performance with various formats, from the text formats (DSV-like) to binary.
We support loading avro data via files and also via kafka ingestion along with STDIN. Please find the documentation link below for the same.
https://www.vertica.com/docs/12.0.x/HTML/Content/Authoring/DataLoad/LoadAVROData.htm
I know it is supported, the question is, "which one is faster" ... NATIVE binary (we presume, but proprietary)...between Avro or CSV which is faster? We presume Avro.
Avro would be faster compared to CSV since it is compact and data is stored in binary format.
"Avro would be faster than CSV"
That is a questionable statement.
Avro and Vertica Native formats have header. Performance of data loads in formats with headers can be described as awkward.
Vertica parsers do not support internal parallelisation of data formats with header, i.e. no apportioning or cooperative loads. Both Avro and Native formats are being parsed strictly in single thread on single node.
CSV and JSON data formats support cooperative, i.e. multithreading parse. Vertica uses 4 threads for parsing. No apportioning parsing, i.e. only single node participate in parsing.
Last time I checked, CSV and JSON cooperative parsing was working only when vsql accepts data from stdin, copy from local do not use cooperative parsing. That makes load from stdin 2-3 times faster than from local. JDBC was not able to use cooperative parse. Disclaimer - may that changed in recent versions.
Both Avro and Native formats are binary, and have advantage of not need expensive string conversions during parsing. But both can be parsed in single thread only on single node.
CSV and JSON are parsed in 4 threads. But they are string formats and need expensive conversions.
You are talking about parsing binary formats in one thread vs parsing CSV format in 4 threads, both times on single node.
In my experience, CSV parsing can beat Native formats, if you are more careful on generating data. Specifically, do not put any date or time or timestamp in string formats. Use number of days since epoch for date, number of micros since midnight for time, number of micros since epoch for timestamp. With this format, CSV load reliably beats Native format on speed. I have not compared to Avro, but due to Avro internal compression algos, parsing of Avro will be slower than Native.
Unfortunately, Vertica do not have date/time/timestamp formats for number of days/number of micros since midnight/epoch. You will need to write c++UDx that takes int and produces Vertica data type, and have it in unfenced mode, and use it in copy transform. It is trivial but out of reach of vast majority of users.
Side advantage of avoiding date/time/timestamp string formats is much faster generation of data files for Vertica loading. I can easily get 4 x times speed up. I am always considering time to generate data files for Vertica loading as part of total load time.
All above is for loading of file from client server. If you involve Kafka or Spark, things became more interesting. When data already reside on Kafka/Spark, load into Vertica is lightning fast and none of above is applied.
Problem is that you need to upload your file into Kafka first, and it is essentially single-threaded process. Total time of uploading to Kafka and them loading into Vertica will be easier slower than straight load into Vertica using vsql or ODBC. Do not forget you need to run sizeable infrastructure to host Spark/Kafka.
Vertica CSV parser is outrageously fast. It is the most used parser, no wander it is highly optimised and deliver very well. I spent quite a lot of time trying to beat CSV loads with binary formats, and that was not easy task, require very non-standard approach and huge amount of coding.
What an outstanding answer. Very interesting!