Vertica Spark Connector throws a java.lang.NullPointerException
I downloaded the Vertica Spark connector and tried the example shown in the connector guide. When I write a DataFrame to Vertica from the Spark-shell using this statement:
df.write.format("com.vertica.spark.datasource.DefaultSource").options(opts).mode(saveMode).save()
I get the following exception. It looks like I have run into a bug in the Vertica connector code. Does anybody know a work around? Thanks.
15/12/14 22:19:08 ERROR TaskSetManager: Task 2 in stage 2.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 93, 10.172.137.138): java.lang.Exception:
Partition[2]: ERROR: Failed while COPYing data to Vertica. partition=2. Error message:java.lang.NullPointerException
at com.vertica.spark.s2v.S2V$$anonfun$1.apply(S2V.scala:199)
at com.vertica.spark.s2v.S2V$$anonfun$1.apply(S2V.scala:113)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Comments
Hi Mohammed,
This can be fixed by please setting the connector's "tmpdir" option, such as "tmpdir"->"/tmp", or whatever dir you prefer. The userguide indicates that setting the "tmpdir" is optional, but unfortunately we have found this is not the case for all users. We will correct this.
Since our tmpdir relies on spark.local.dir, another way to fix is to set spark.local.dir in your conf/spark-defaults.conf file or set SPARK_LOCAL_DIRS in your spark-env.sh file. To check the current value of spark.local.dir, you can refer to the env tab of the Spark Master web interface or on the command line type:
Thank you,
Jeff