Vertica-Spark - AWS EMR - Failure due to timeout to get job status using webhdfs
Hi Folks:
I am following the example https://my.vertica.com/get-started-vertica/integrating-apache-spark/ on AWS EMR using spark-shell .
I am using
- spark 2.1.0 ,
- vertica-8.1.0_spark2.0_scala2.11.jar
- vertica-jdbc-8.1.0-3.jar.
I am getting the error mentioned below.
I've validated that from the EMR master node I can use curl to access the url mentioned in the error.
Please let me know how to resolve this issue.
Thanks
Curl results:
curl 'http://xxx-xx-x-xxx.us-west-2.compute.internal:50070/webhdfs/v1/user/test/vertica/S2V_job5301965226302870212/?user.name=dbadmin&op=LISTSTATUS'
{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"File /user/test/vertica/S2V_job5301965226302870212/ does not exist."}}
Exception in the spark-shell on saving the dataframe to vertica
17/06/09 21:23:19 ERROR S2V: Failed to save DataFrame to Vertica table: public.S2V_test_table
java.lang.Exception: S2V: FATAL ERROR for job S2V_job5301965226302870212. Job status information is available in the Vertica table public.S2V_JOB_STATUS_USER_DBADMIN. Unable to create/insert into target table public.S2V_test_table with SaveMode: Append. ERROR MESSAGE: ERROR: java.sql.SQLException: [Vertica]VJDBC ERROR: ****Failed to glob "webhdfs://xxx-xx-x-xxx.us-west-2.compute.internal:50070/user/test/vertica/S2V_job5301965226302870212/*.orc" because of error: [http://ip-xxx-xx-x-xxx.us-west-2.compute.internal:50070/webhdfs/v1/user/test/vertica/S2V_job5301965226302870212/?user.name=dbadmin&op=LISTSTATUS]: Curl Error: Couldn't connect to server
Error Details: Failed to connect to ip-xxx-xx-x-xxx.us-west-2.compute.internal port 50070: Connection timed out****
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:342)
at com.vertica.spark.s2v.S2V.save(S2V.scala:392)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:88)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 54 elided
Comments
Hi:
The issue is resolved and it was connectivity between Vertica and hadoop cluster that was causing the failure.
Thanks
Hi mans4singh
How was the issue solved ,any details will be highly appreciated
P