Load Spark DataFrame to Vertica Table using Spark Vertica connector

Hi all,

I am trying to load a Spark data frame data to a Vertica table using Vertica connector vertica-8.1.1_spark1.6_scala2.10-20170623.jar. Here is the statement from my Java program:

    df.write().format("com.vertica.spark.datasource.DefaultSource").option("table", "temp01")
            .option("db", "DB").option("dbschema", "SCHEMA").option("user", "USER")
            .option("password", "PASS").option("host", "VHOST")
            .option("hdfs_url", "hdfs://NameNode:9000/tmp/temp")
            .option("web_hdfs_url", "webhdfs://NameNode:50070/tmp/temp").option("autocommit", "true")
            .mode(SaveMode.Append).save();

I do not have the HDFS scheme setup for Vertica Hadoop integration but I am trying to make use of web_hdfs_url. I get the following info message from the connector and the program hangs after that. I do not get any data written to the temp hdfs location in ORC format. Any suggestions will be appreciated.

INFO s2v.S2VUtils: Unique job name:S2V_job7768495893071306009 will appear in Vertica table: USER_30_DAY_TABLES.S2V_JOB_STATUS_USER_DBUSER.

Comments

  • Hi MSQ
    In the case of moving data from Spark to Vertica, it is a requirement that your Vertica Cluster is configured to talk to your HDFS cluster. Enabling WebHDFS only is not enough. Here are the steps to configure Vertica to access your HDFS cluster:

    1. Copy Hadoop config files: "core-site.xml" and "hdfs-site.xml" on each Vertica node in this location etc/hadoop/conf.
    2. If needed, change the permissions and ownership of the directory etc/hadoop/conf and config files to make sure they are accessible by your database user "USER"
    3. Execute SELECT VERIFY_HADOOP_CONF_DIR(); from VSQL on your Vertica cluster to check that the configuration is correct.

    Your spark program should work after that.

    1. Additionally you can test that the configuration is correct by reading an ORC file using Vertica's ORC reader, this is an example you can run in VSQL:

    COPY "SCHEMA"."temp01" FROM 'hdfs://NameNode:9000/tmp/temp/*.orc' ON ANY NODE orc DIRECT REJECTED DATA AS table "test_job596614995762348712_REJECTS" NO COMMIT;

    More information on Vertica's official documentation: https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/libhdfs/ConfiguringAccessToHDFS.htm?Highlight=Configuring the hdfs Scheme

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file