Load Spark DataFrame to Vertica Table using Spark Vertica connector
Hi all,
I am trying to load a Spark data frame data to a Vertica table using Vertica connector vertica-8.1.1_spark1.6_scala2.10-20170623.jar. Here is the statement from my Java program:
df.write().format("com.vertica.spark.datasource.DefaultSource").option("table", "temp01") .option("db", "DB").option("dbschema", "SCHEMA").option("user", "USER") .option("password", "PASS").option("host", "VHOST") .option("hdfs_url", "hdfs://NameNode:9000/tmp/temp") .option("web_hdfs_url", "webhdfs://NameNode:50070/tmp/temp").option("autocommit", "true") .mode(SaveMode.Append).save();
I do not have the HDFS scheme setup for Vertica Hadoop integration but I am trying to make use of web_hdfs_url. I get the following info message from the connector and the program hangs after that. I do not get any data written to the temp hdfs location in ORC format. Any suggestions will be appreciated.
INFO s2v.S2VUtils: Unique job name:S2V_job7768495893071306009 will appear in Vertica table: USER_30_DAY_TABLES.S2V_JOB_STATUS_USER_DBUSER.
0
Comments
Hi MSQ
In the case of moving data from Spark to Vertica, it is a requirement that your Vertica Cluster is configured to talk to your HDFS cluster. Enabling WebHDFS only is not enough. Here are the steps to configure Vertica to access your HDFS cluster:
Your spark program should work after that.
COPY "SCHEMA"."temp01" FROM 'hdfs://NameNode:9000/tmp/temp/*.orc' ON ANY NODE orc DIRECT REJECTED DATA AS table "test_job596614995762348712_REJECTS" NO COMMIT;
More information on Vertica's official documentation: https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/libhdfs/ConfiguringAccessToHDFS.htm?Highlight=Configuring the hdfs Scheme