Unable to write spark dataframe to vertica

ngoyal · October 2018

Hi,

I am trying to write dataframe to vertica table. Hdfs scheme isn't configured correctly in the cluster hence I am using web_hdfs_url.
I am getting this error

java.lang.Exception: S2V: FATAL ERROR for job S2V_job3219263099001614116. Job status information is available in the Vertica table test.S2V_JOB_STATUS_USER_NGOYAL. Unable to create/insert into target table test.ngoyal-tst with SaveMode: Overwrite. ERROR MESSAGE: ERROR: java.sql.SQLException: [Vertica]VJDBC ERROR: Failed to glob [webhdfs://hadoop-dw2-nn.smf1.com:50070/user/ngoyal/tempx/S2V_job3219263099001614116/*.parquet] because of error: Seen WebHDFS exception:
URL: [http://hadoop-dw2-nn.smf1.com:50070/webhdfs/v1/user/ngoyal/tempx/S2V_job3219263099001614116/?user.name=ngoyal&op=LISTSTATUS]
HTTP response code: 404
Exception type: FileNotFoundException
Exception message: File /user/ngoyal/tempx/S2V_job3219263099001614116/ does not exist.

This is how I am writing dataframe

dataFrame
  .write
  .format("com.vertica.spark.datasource.DefaultSource")
  .options(connectionProperties)
  .mode(saveMode)
  .save()

Connyrt-emp1 · October 2018

Hi ngoyal,

From your error message it seems that your webhdfs API is not configured correctly. The parameter "web_hdfs_url" is optional and should be provided in addition to “hdfs_url”. The parameter "hdfs_url" is required and it is the recommended option.

Have you tested that your webhdfs API is enabled and working correctly?

Vertica cluster should be configure to talk to your HDFS, you should copy your HDFS config files core-site.xml and hdfs-site.xml to each Vertica node. More details on our official documentation: https://www.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/libhdfs/ConfiguringAccessToHDFS.htm

For webhdfs configuration it needs to be enabled on the HDFS configuration file hdfs-site.xml as follows:

Before making changes stop Hadoop services. Make the changes and replicate the files to other nodes and to Vertica nodes, for example:

    <property>
       <name>dfs.webhdfs.enabled</name>
       <value>true</value>
    </property>

Bring services back up again.

You can test that the webhdfs rest API is working by invoking curl.

ngoyal · October 2018

I did setup the web_hdfs_url. These are the logs of a failed job

18/10/24 02:00:08 INFO s2v.S2V: verify_hadoop_conf_dir() validation Failed
18/10/24 02:00:08 INFO s2v.S2V: Check if web_hdfs_url is provided ....
18/10/24 02:00:08 INFO s2v.S2V: web_hdfs_url is provided .....
18/10/24 02:00:09 INFO s2v.S2V: COPYing data into Vertica via statement: COPY "test"."ngoyal-tst" FROM 'webhdfs://hadoop-dw2-nn.smf1.com:50070/user/ngoyal/tempx/S2V_job7762645538001378003/.parquet' ON ANY NODE parquet DIRECT REJECTED DATA AS table "ngoyal-tst_S2V_job7762645538001378003_REJECTS" NO COMMIT
18/10/24 02:00:10 INFO s2v.S2VUtils: Attempting to delete HDFS path: hdfs://hadoop-dw2-nn.smf1.com/user/ngoyal/tempx/S2V_job7762645538001378003
18/10/24 02:00:10 INFO s2v.S2VUtils: Successfully deleted HDFS path: hdfs://hadoop-dw2-nn.smf1.com/user/ngoyal/tempx/S2V_job7762645538001378003
18/10/24 02:00:10 ERROR s2v.S2V: Failed to save DataFrame to Vertica table: test.ngoyal-tst
18/10/24 02:00:10 INFO s2v.S2V: Failure: Additional Job status info is available in Vertica table: test.S2V_JOB_STATUS_USER_NGOYAL
18/10/24 02:00:10 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.Exception: S2V: FATAL ERROR for job S2V_job7762645538001378003. Job status information is available in the Vertica table test.S2V_JOB_STATUS_USER_NGOYAL. Unable to create/insert into target table test.ngoyal-tst with SaveMode: Append. ERROR MESSAGE: ERROR: java.sql.SQLException: [Vertica]VJDBC ERROR: Failed to glob [webhdfs://hadoop-dw2-nn.smf1.com:50070/user/ngoyal/tempx/S2V_job7762645538001378003/.parquet] because of error: Seen WebHDFS exception:
URL: [http://hadoop-dw2-nn.smf1.com:50070/webhdfs/v1/user/ngoyal/tempx/S2V_job7762645538001378003/?user.name=ngoyal&op=LISTSTATUS]
HTTP response code: 404
Exception type: FileNotFoundException
Exception message: File /user/ngoyal/tempx/S2V_job7762645538001378003/ does not exist.

It seems like it itself deletes the directory and then tries to access it which is weird.

ngoyal · October 2018

Seems like webhdfs isnt listing files in the directory. Thanks for the help

Connyrt-emp1 · October 2018

The COPY statement will fail if there are no files in HDFS. Have you tried writing your DF to HDFS directly from Spark? for example:

mydf.write.parquet(webhdfs_path)

Does it work?

Also, I am curious, is there a particular reason why you are required to use parquet as opposed to ORC format? ORC is the default and recommended option.

Can you capture the log again but this time enabling debugging sc.setLogLevel("DEBUG") and post it again.

The step that is deleting the HDFS files is correct and it is simple to make sure the folder is empty before attempting to write to HDFS.

We're Moving!

Create My New Community Account Now

Unable to write spark dataframe to vertica

Comments

Leave a Comment