Vertica-Spark - AWS EMR - Failure due to timeout to get job status using webhdfs

mans4singh · June 2017

Hi Folks:

I am following the example https://my.vertica.com/get-started-vertica/integrating-apache-spark/ on AWS EMR using spark-shell .

I am using

spark 2.1.0 ,
vertica-8.1.0_spark2.0_scala2.11.jar
vertica-jdbc-8.1.0-3.jar.

I am getting the error mentioned below.

I've validated that from the EMR master node I can use curl to access the url mentioned in the error.

Please let me know how to resolve this issue.

Thanks

Curl results:

curl 'http://xxx-xx-x-xxx.us-west-2.compute.internal:50070/webhdfs/v1/user/test/vertica/S2V_job5301965226302870212/?user.name=dbadmin&op=LISTSTATUS'
{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"File /user/test/vertica/S2V_job5301965226302870212/ does not exist."}}

Exception in the spark-shell on saving the dataframe to vertica

17/06/09 21:23:19 ERROR S2V: Failed to save DataFrame to Vertica table: public.S2V_test_table
java.lang.Exception: S2V: FATAL ERROR for job S2V_job5301965226302870212. Job status information is available in the Vertica table public.S2V_JOB_STATUS_USER_DBADMIN. Unable to create/insert into target table public.S2V_test_table with SaveMode: Append. ERROR MESSAGE: ERROR: java.sql.SQLException: [Vertica]VJDBC ERROR: ****Failed to glob "webhdfs://xxx-xx-x-xxx.us-west-2.compute.internal:50070/user/test/vertica/S2V_job5301965226302870212/*.orc" because of error: [http://ip-xxx-xx-x-xxx.us-west-2.compute.internal:50070/webhdfs/v1/user/test/vertica/S2V_job5301965226302870212/?user.name=dbadmin&op=LISTSTATUS]: Curl Error: Couldn't connect to server
Error Details: Failed to connect to ip-xxx-xx-x-xxx.us-west-2.compute.internal port 50070: Connection timed out****
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:342)
at com.vertica.spark.s2v.S2V.save(S2V.scala:392)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:88)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 54 elided

mans4singh · June 2017

Hi:

The issue is resolved and it was connectivity between Vertica and hadoop cluster that was causing the failure.

Thanks

Prakhar84 · August 2019

Hi mans4singh
How was the issue solved ,any details will be highly appreciated
P

We're Moving!

Create My New Community Account Now

Vertica-Spark - AWS EMR - Failure due to timeout to get job status using webhdfs

Comments

Leave a Comment