Class Not Found - com.vertica.spark.datasource.DefaultSource

Hi,

When i try to ingest data from the Vertica DB using Spark (java), am ending up with the error below.
stack-trace="java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html

I understood that the jar is not added to my class path. However am not finding the actual jar file which i need to use/download.

Please let me know from where i can download the actual jar file?

Answers

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Prior to Vertica version 9.1, the Spark Connector was distributed on the myVertica portal. If you have a version of Vertica prior to 9.1, you must download the correct connector file from the Vertica portal.

    For Vertica 9.1:

    • The Spark Connector files are located in /opt/vertica/packages/SparkConnector/lib.
    • The JDBC client library is /opt/vertica/java/vertica-jdbc.jar.
  • hi jim,

    i have installed vertica 9.1 clent on EMR, but i am not able to connect vertica from pyspark

    am getting the below error

    java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
    Caused by: java.lang.ClassNotFoundException: com.vertica.spark.datasource.DefaultSource.DefaultSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
    at scala.util.Try$.apply(Try.scala:192)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
    at scala.util.Try.orElse(Try.scala:84)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
    ... 13 more

  • LenoyJLenoyJ - Select Field - Employee

    Ensure that JDBC and the Vertica-Spark Connector jar files are added to class path or specify as parameter when running pyspark

    Something like:
    ./bin/pyspark --jars /opt/vertica/packages/SparkConnector/lib/vertica-spark2.1_scala2.11.jar,/opt/vertica/java/vertica-jdbc.jar

  • edited May 2019

    after setting the jars now am getting different error.

    java.lang.Exception: Specified relation name "SBG_PUBLISHED.ab_testing_data" does not exist.
    at com.vertica.spark.seg.SegmentsMetaInfo$class.getSyntheticSegExpr(SegmentsMetaInfo.scala:251)
    at com.vertica.spark.datasource.DefaultSource.getSyntheticSegExpr(VerticaSource.scala:15)
    at com.vertica.spark.seg.SegmentsMetaInfo$class.initSegInfo(SegmentsMetaInfo.scala:50)
    at com.vertica.spark.datasource.DefaultSource.initSegInfo(VerticaSource.scala:15)
    at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:44)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)

    Note: i tried without schema name as well still the same error

    below is my code

    pyspark --jars /opt/vertica/vertica-spark2.1_scala2.11.jar,/opt/vertica/vertica-jdbc-9.1.1_20190520.jar

    from pyspark.sql import SQLContext
    from pyspark.sql import DataFrame
    sqlContext = SQLContext(sc)

    opts={}
    opts['table'] = 'SBG_PUBLISHED.ab_testing_data'
    opts['db']='IDEA'
    opts['user']='svenkatesh'
    opts['password']='Password7'
    opts['host']='10.81.189.230'

    df_new=sqlContext.read.load(format="com.vertica.spark.datasource.DefaultSource", **opts)

    view the new DataFrame

    df_new.take(10)

  • LenoyJLenoyJ - Select Field - Employee

    You need to specify your schema in a separate parameter called 'dbschema' else it will default to public.

    opts['dbschema'] = 'SBG_PUBLISHED'
    opts['table'] = 'ab_testing_data'
    

    Please read following documentation for 9.1.x for the list of available parameters to configure: https://www.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/SparkConnector/WritingtoVerticaUsingDefaultSource.htm#Parameters

  • Hi lenoy,

    Now when i trying to write the data to vertica from hive am getting the below error can you please help me on this

    ERROR S2VUtils: ERROR: Could not connect to Vertica: java.sql.SQLNonTransientConnectionException: [Vertica][VJDBC](100176) Failed to connect to host 10.81.189.230 on port 5433. Reason: Failed to establish a connection to the primary server or any backup address. RROR S2VUtils: getNodeIPsFromVertica: FAILED to retrieve node IPs from Vertica host:10.81.189.230 Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o357.save.
    java.lang.Exception: ERROR: S2V.save(): Unable to retrieve Vertica node IPs from Vertica host:10.81.189.230
    at com.vertica.spark.s2v.S2VUtils.getVerticaNodeIPs(S2VUtils.scala:630)
    at com.vertica.spark.s2v.S2V.save(S2V.scala:377)
    at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)

    code

    pyspark --jars /opt/vertica/vertica-spark2.1_scala2.11.jar,/opt/vertica/vertica-jdbc-9.1.1_20190520.jar

    from pyspark.sql import SQLContext
    from pyspark.sql import DataFrame
    sqlContext = SQLContext(sc)

    opts={}
    opts['dbschema'] = 'SBG_SANDBOX'
    opts['table'] = 'test_sbg'
    opts['db']='idea'
    opts['user']='svenkatesh'
    opts['password']='Password7'
    opts['host']='10.81.189.230'
    opts['hdfs_url']='hdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/'

    df_new = sqlContext.sql("select product from sbg_hvc_ods.pc_all_products_ingestion_to_profile limit 1 ")

    df_new.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="append", **opts)

  • Hi lenoy,

    Now when i trying to write the data to vertica from hive am getting the below error can you please help me on this

    File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o145.save.
    java.lang.Exception: S2V: FATAL ERROR for job S2V_job4908017182903424450. Job status information is available in the Vertica table SBG_SANDBOX.S2V_JOB_STATUS_USER_SVENKATESH. Unable to create/insert into target table SBG_SANDBOX.test_sbg with SaveMode: Overwrite. ERROR MESSAGE: ERROR: java.lang.Exception: S2V: FATAL ERROR for job S2V_job4908017182903424450. Either set web_hdfs_url or check VERIFY_HADOOP_CONF_DIR() passes with Vertica
    at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:339)
    at com.vertica.spark.s2v.S2V.save(S2V.scala:389)
    at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

    from pyspark.sql import SQLContext
    from pyspark.sql import DataFrame
    sqlContext = SQLContext(sc)

    opts={}
    opts['dbschema'] = 'SBG_SANDBOX'
    opts['table'] = 'test_sbg'
    opts['db']='idea'
    opts['user']='svenkatesh'
    opts['password']='Password7'
    opts['host']='10.81.189.230'
    opts['hdfs_url']='hdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/'

    df_new = sqlContext.sql("select product from sbg_hvc_ods.pc_all_products_ingestion_to_profile limit 1 ")

    df_new.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="Overwrite", **opts)

  • Hi lenoy ,

    i tried by providing the webhdfs url as well.but am getting the below error, please help me on this

    9/05/31 04:51:49 ERROR S2V: Failed to save DataFrame to Vertica table: SBG_SANDBOX.test_sbg Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
    java.lang.Exception: S2V: FATAL ERROR for job S2V_job1940324723276020621. Job status information is available in the Vertica table SBG_SANDBOX.S2V_JOB_STATUS_USER_SVENKATESH. Unable to create/insert into target table SBG_SANDBOX.test_sbg with SaveMode: Overwrite. ERROR MESSAGE: ERROR: java.sql.SQLException: [Vertica]VJDBC ERROR: Failed to glob [webhdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/vertica/S2V_job1940324723276020621/*.orc] because of error: [http://ip-10-68-64-76.us-west-2.compute.internal:8020/webhdfs/v1/user/hvc-prd-default-ged/vertica/S2V_job1940324723276020621/?op=LISTSTATUS&user.name=svenkatesh]: Curl Error: Couldn't connect to server
    Error Details: Failed to connect to ip-10-68-64-76.us-west-2.compute.internal port 8020: Connection timed out
    at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:339)
    at com.vertica.spark.s2v.S2V.save(S2V.scala:389)
    at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
  • LenoyJLenoyJ - Select Field - Employee
    edited May 2019

    The Vertica-Spark connector uses HDFS to dump the dataframe and then loads into Vertica. So you must configure your Vertica cluster to access HDFS first. To do that:

    • Locate your HDFS config files: core-site.xml and hdfs-site.xml.
    • Copy them over to every Vertica node you have at the same path. Say, /home/dbadmin
    • Run vsql and set it to the directory:
      ALTER DATABASE dbname SET HadoopConfDir = '/home/dbadmin';

    • If you are on >8.1, verify if it is okay with:
      SELECT VERIFY_HADOOP_CONF_DIR();

    • If you are on >9.2, a better function is available to verify:
      SELECT HDFS_CLUSTER_CONFIG_CHECK();

    Then try again.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file