Unable to write spark dataframe to vertica

ngoyalngoyal Registered User

Hi,

I am trying to write dataframe to vertica table. Hdfs scheme isn't configured correctly in the cluster hence I am using web_hdfs_url.
I am getting this error

java.lang.Exception: S2V: FATAL ERROR for job S2V_job3219263099001614116. Job status information is available in the Vertica table test.S2V_JOB_STATUS_USER_NGOYAL. Unable to create/insert into target table test.ngoyal-tst with SaveMode: Overwrite. ERROR MESSAGE: ERROR: java.sql.SQLException: [Vertica]VJDBC ERROR: Failed to glob [webhdfs://hadoop-dw2-nn.smf1.com:50070/user/ngoyal/tempx/S2V_job3219263099001614116/*.parquet] because of error: Seen WebHDFS exception:
URL: [http://hadoop-dw2-nn.smf1.com:50070/webhdfs/v1/user/ngoyal/tempx/S2V_job3219263099001614116/?user.name=ngoyal&op=LISTSTATUS]
HTTP response code: 404
Exception type: FileNotFoundException
Exception message: File /user/ngoyal/tempx/S2V_job3219263099001614116/ does not exist.

This is how I am writing dataframe

dataFrame
  .write
  .format("com.vertica.spark.datasource.DefaultSource")
  .options(connectionProperties)
  .mode(saveMode)
  .save()

Comments

  • Connyrt-emp1Connyrt-emp1 Employee, Registered User

    Hi ngoyal,

    From your error message it seems that your webhdfs API is not configured correctly. The parameter "web_hdfs_url" is optional and should be provided in addition to “hdfs_url”. The parameter "hdfs_url" is required and it is the recommended option.

    Have you tested that your webhdfs API is enabled and working correctly?

    Vertica cluster should be configure to talk to your HDFS, you should copy your HDFS config files core-site.xml and hdfs-site.xml to each Vertica node. More details on our official documentation: https://www.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/libhdfs/ConfiguringAccessToHDFS.htm

    For webhdfs configuration it needs to be enabled on the HDFS configuration file hdfs-site.xml as follows:

    Before making changes stop Hadoop services. Make the changes and replicate the files to other nodes and to Vertica nodes, for example:

        <property>
           <name>dfs.webhdfs.enabled</name>
           <value>true</value>
        </property>
    

    Bring services back up again.

    You can test that the webhdfs rest API is working by invoking curl.

  • ngoyalngoyal Registered User

    I did setup the web_hdfs_url. These are the logs of a failed job

    18/10/24 02:00:08 INFO s2v.S2V: verify_hadoop_conf_dir() validation Failed
    18/10/24 02:00:08 INFO s2v.S2V: Check if web_hdfs_url is provided ....
    18/10/24 02:00:08 INFO s2v.S2V: web_hdfs_url is provided .....
    18/10/24 02:00:09 INFO s2v.S2V: COPYing data into Vertica via statement: COPY "test"."ngoyal-tst" FROM 'webhdfs://hadoop-dw2-nn.smf1.com:50070/user/ngoyal/tempx/S2V_job7762645538001378003/.parquet' ON ANY NODE parquet DIRECT REJECTED DATA AS table "ngoyal-tst_S2V_job7762645538001378003_REJECTS" NO COMMIT
    18/10/24 02:00:10 INFO s2v.S2VUtils: Attempting to delete HDFS path: hdfs://hadoop-dw2-nn.smf1.com/user/ngoyal/tempx/S2V_job7762645538001378003
    18/10/24 02:00:10 INFO s2v.S2VUtils: Successfully deleted HDFS path: hdfs://hadoop-dw2-nn.smf1.com/user/ngoyal/tempx/S2V_job7762645538001378003
    18/10/24 02:00:10 ERROR s2v.S2V: Failed to save DataFrame to Vertica table: test.ngoyal-tst
    18/10/24 02:00:10 INFO s2v.S2V: Failure: Additional Job status info is available in Vertica table: test.S2V_JOB_STATUS_USER_NGOYAL
    18/10/24 02:00:10 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.Exception: S2V: FATAL ERROR for job S2V_job7762645538001378003. Job status information is available in the Vertica table test.S2V_JOB_STATUS_USER_NGOYAL. Unable to create/insert into target table test.ngoyal-tst with SaveMode: Append. ERROR MESSAGE: ERROR: java.sql.SQLException: [Vertica]VJDBC ERROR: Failed to glob [webhdfs://hadoop-dw2-nn.smf1.com:50070/user/ngoyal/tempx/S2V_job7762645538001378003/
    .parquet] because of error: Seen WebHDFS exception:
    URL: [http://hadoop-dw2-nn.smf1.com:50070/webhdfs/v1/user/ngoyal/tempx/S2V_job7762645538001378003/?user.name=ngoyal&op=LISTSTATUS]
    HTTP response code: 404
    Exception type: FileNotFoundException
    Exception message: File /user/ngoyal/tempx/S2V_job7762645538001378003/ does not exist.

    It seems like it itself deletes the directory and then tries to access it which is weird.

  • ngoyalngoyal Registered User

    Seems like webhdfs isnt listing files in the directory. Thanks for the help

  • Connyrt-emp1Connyrt-emp1 Employee, Registered User
    edited October 24

    The COPY statement will fail if there are no files in HDFS. Have you tried writing your DF to HDFS directly from Spark? for example:

    mydf.write.parquet(webhdfs_path)

    Does it work?

    Also, I am curious, is there a particular reason why you are required to use parquet as opposed to ORC format? ORC is the default and recommended option.

    Can you capture the log again but this time enabling debugging sc.setLogLevel("DEBUG") and post it again.

    The step that is deleting the HDFS files is correct and it is simple to make sure the folder is empty before attempting to write to HDFS.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file