Vertica connector for Spark from Kerberos enabled to non Kerberos enabled Vertica cluster
I am experimenting with the Spark Vertica connector and trying to load from a Kerberos enabled Hadoop cluster to a non Kerberos Vertica cluster. I am unable to setup the HDFS scheme as the Vertica cluster doesnt support Kerberos. I am currently experimenting loading using the Vertica copy stream option using JDBC driver but would assume that the connector will have better throughput. My questions below:
1. Can the Spark Vertica connector be used to send data from a Kerberos enabled hadoop cluster to a non Kerberos Vertica cluster?
2. What is the fastest method to load data into Vertica from an external system ? (Spark-Vertica connector, Vertica Copy Stream JDBC, JDBC bulk loading, Streaming from Kafka to Vertica)
Appreciate all the input.
0
Comments
MSQ, I am not sure we have tested this particular scenario but I would assume that there is no problem and you can use the Spark connector to move data from Spark into Vertica using a Kerberized HDFS cluster as an intermediate storage. COPY bulk load is the most efficient way. Vertica Copy Stream uses copy local and copy local only loads data to the initiator node, while COPY bulk run on all nodes. The Spark connector uses COPY bulk load under the hood but it is a two step process. 1. It first moves the in-memory data from Spark into HDFS. 2. Then it uses COPY bulk load to move data from HDFS (orc/parquet) into Vertica. If your data is in HDFS you can instead of using a two step process, use COPY bulk load directly to move your data from HDFS to Vertica.
Thank you nrodriguez for the response. I was under the assumption that Vertica bulk copy/Spark Vertica connector require an HDFS scheme to be setup so that Vertica can load files on HDFS. That is I need to specify the HDFS url/WebHDFS url when I execute df.write().save(). But unfortunately Vertica need to have Kerberos enabled for this as per my understanding. Is there a way to bulk load using COPY without kerberos on Vertica and HDFS with kerberos?
Also, to add Bulk COPY requires the data to be local to Vertica and if the data is on HDFS it will still require a HDFS Scheme setup. Its a similar issue as above.
Yes, we recently added Kerberos authentication via HDFS delegation tokens that does not require Vertica to be Kerberized.
https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/Kerberos/DT.htm?Highlight=delegation token
Otherwise, Vertica has to be Kerberized in order to handle the user Kerberos ticket.
Ok so if I do not have kerberized Vertica and version Vertica < 9.1 then what will be the fastest option to load to Vertica? Also, when a bulk copy is issued does the load spread across all the Vertica nodes using multiple jdbc connections per copy command or it uses only one connection and internally Vertica distributes the data?
Can you give more information, where is your data, HDFS? what is the format of your files, ORC/parquet? the right choice will depend on the format.
The orc & parquet parsers are really good for data on HDFS. Here is the information about orc/parquet parsers in the oficial docs: https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/COPY/Parameters.htm?Highlight=orc
This is an example of a COPY statement using the ORC parser:
COPY public.target_table FROM 'hdfs://:8020/tmp/job1672401524473573874/*.orc' ON ANY NODE ORC DIRECT REJECTED DATA AS TABLE TEST_REJECT_DATA_TABLE;
Is there more documentation for setting up non kerberized Vertica to work with loading data from kerberized HDFS. Any details will be helpful. Also, are there any performance graphs/metrics based on Spark-Vertica load connector comparing these to JDBC loads vs Kafka loads?