Vertica connector for Spark from Kerberos enabled to non Kerberos enabled Vertica cluster

MSQMSQ Registered User

I am experimenting with the Spark Vertica connector and trying to load from a Kerberos enabled Hadoop cluster to a non Kerberos Vertica cluster. I am unable to setup the HDFS scheme as the Vertica cluster doesnt support Kerberos. I am currently experimenting loading using the Vertica copy stream option using JDBC driver but would assume that the connector will have better throughput. My questions below:
1. Can the Spark Vertica connector be used to send data from a Kerberos enabled hadoop cluster to a non Kerberos Vertica cluster?
2. What is the fastest method to load data into Vertica from an external system ? (Spark-Vertica connector, Vertica Copy Stream JDBC, JDBC bulk loading, Streaming from Kafka to Vertica)

Appreciate all the input.

Comments

  • nrodrigueznrodriguez Registered User

    MSQ, I am not sure we have tested this particular scenario but I would assume that there is no problem and you can use the Spark connector to move data from Spark into Vertica using a Kerberized HDFS cluster as an intermediate storage. COPY bulk load is the most efficient way. Vertica Copy Stream uses copy local and copy local only loads data to the initiator node, while COPY bulk run on all nodes. The Spark connector uses COPY bulk load under the hood but it is a two step process. 1. It first moves the in-memory data from Spark into HDFS. 2. Then it uses COPY bulk load to move data from HDFS (orc/parquet) into Vertica. If your data is in HDFS you can instead of using a two step process, use COPY bulk load directly to move your data from HDFS to Vertica.

  • MSQMSQ Registered User

    Thank you nrodriguez for the response. I was under the assumption that Vertica bulk copy/Spark Vertica connector require an HDFS scheme to be setup so that Vertica can load files on HDFS. That is I need to specify the HDFS url/WebHDFS url when I execute df.write().save(). But unfortunately Vertica need to have Kerberos enabled for this as per my understanding. Is there a way to bulk load using COPY without kerberos on Vertica and HDFS with kerberos?

  • MSQMSQ Registered User

    Also, to add Bulk COPY requires the data to be local to Vertica and if the data is on HDFS it will still require a HDFS Scheme setup. Its a similar issue as above.

  • nrodrigueznrodriguez Registered User

    Yes, we recently added Kerberos authentication via HDFS delegation tokens that does not require Vertica to be Kerberized.

    https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/Kerberos/DT.htm?Highlight=delegation token

    Otherwise, Vertica has to be Kerberized in order to handle the user Kerberos ticket.

  • MSQMSQ Registered User

    Ok so if I do not have kerberized Vertica and version Vertica < 9.1 then what will be the fastest option to load to Vertica? Also, when a bulk copy is issued does the load spread across all the Vertica nodes using multiple jdbc connections per copy command or it uses only one connection and internally Vertica distributes the data?

  • nrodrigueznrodriguez Registered User
    edited June 21

    Can you give more information, where is your data, HDFS? what is the format of your files, ORC/parquet? the right choice will depend on the format.
    The orc & parquet parsers are really good for data on HDFS. Here is the information about orc/parquet parsers in the oficial docs: https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/COPY/Parameters.htm?Highlight=orc

    This is an example of a COPY statement using the ORC parser:

    COPY public.target_table FROM 'hdfs://:8020/tmp/job1672401524473573874/*.orc' ON ANY NODE ORC DIRECT REJECTED DATA AS TABLE TEST_REJECT_DATA_TABLE;

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file