Vertica Spark Connector throws a java.lang.NullPointerException

I downloaded the Vertica Spark connector and tried the example shown in the connector guide. When I write a DataFrame to Vertica from the Spark-shell using this statement:

 

df.write.format("com.vertica.spark.datasource.DefaultSource").options(opts).mode(saveMode).save()

 

I get the following exception. It looks like I have run into a bug in the Vertica connector code. Does anybody know a work around? Thanks.

 

15/12/14 22:19:08 ERROR TaskSetManager: Task 2 in stage 2.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 93, 10.172.137.138): java.lang.Exception:
Partition[2]: ERROR: Failed while COPYing  data to Vertica.  partition=2. Error message:java.lang.NullPointerException
        at com.vertica.spark.s2v.S2V$$anonfun$1.apply(S2V.scala:199)
        at com.vertica.spark.s2v.S2V$$anonfun$1.apply(S2V.scala:113)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Comments

  • Hi Mohammed, 

    This can be fixed by please setting the connector's "tmpdir" option, such as "tmpdir"->"/tmp", or whatever dir you prefer.  The userguide indicates that setting the "tmpdir" is optional, but unfortunately we have found this is not the case for all users.  We will correct this.  

     

    Since our tmpdir relies on spark.local.dir, another way to fix is to set spark.local.dir in your conf/spark-defaults.conf file or set SPARK_LOCAL_DIRS in your spark-env.sh file.  To check the current value of spark.local.dir, you can refer to the env tab of the Spark Master web interface or on the command line type: 

    sqlContext.getConf("spark.local.dir", "this is not set")

     

    Thank you,

    Jeff

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file