Class Not Found - com.vertica.spark.datasource.DefaultSource

bala_8383 · September 2018

Hi,

When i try to ingest data from the Vertica DB using Spark (java), am ending up with the error below.
stack-trace="java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html

I understood that the jar is not added to my class path. However am not finding the actual jar file which i need to use/download.

Please let me know from where i can download the actual jar file?

Jim_Knicely · September 2018

Prior to Vertica version 9.1, the Spark Connector was distributed on the myVertica portal. If you have a version of Vertica prior to 9.1, you must download the correct connector file from the Vertica portal.

For Vertica 9.1:

The Spark Connector files are located in /opt/vertica/packages/SparkConnector/lib.
The JDBC client library is /opt/vertica/java/vertica-jdbc.jar.

saivenkateshms · May 2019

hi jim,

i have installed vertica 9.1 clent on EMR, but i am not able to connect vertica from pyspark

am getting the below error: java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.vertica.spark.datasource.DefaultSource.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 13 more

LenoyJ · May 2019

Ensure that JDBC and the Vertica-Spark Connector jar files are added to class path or specify as parameter when running pyspark

Something like:
./bin/pyspark --jars /opt/vertica/packages/SparkConnector/lib/vertica-spark2.1_scala2.11.jar,/opt/vertica/java/vertica-jdbc.jar

saivenkateshms · May 2019

after setting the jars now am getting different error.

java.lang.Exception: Specified relation name "SBG_PUBLISHED.ab_testing_data" does not exist.
at com.vertica.spark.seg.SegmentsMetaInfo$class.getSyntheticSegExpr(SegmentsMetaInfo.scala:251)
at com.vertica.spark.datasource.DefaultSource.getSyntheticSegExpr(VerticaSource.scala:15)
at com.vertica.spark.seg.SegmentsMetaInfo$class.initSegInfo(SegmentsMetaInfo.scala:50)
at com.vertica.spark.datasource.DefaultSource.initSegInfo(VerticaSource.scala:15)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)

Note: i tried without schema name as well still the same error

below is my code

pyspark --jars /opt/vertica/vertica-spark2.1_scala2.11.jar,/opt/vertica/vertica-jdbc-9.1.1_20190520.jar

from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
sqlContext = SQLContext(sc)

opts={}
opts['table'] = 'SBG_PUBLISHED.ab_testing_data'
opts['db']='IDEA'
opts['user']='svenkatesh'
opts['password']='Password7'
opts['host']='10.81.189.230'

df_new=sqlContext.read.load(format="com.vertica.spark.datasource.DefaultSource", **opts)

view the new DataFrame

df_new.take(10)

LenoyJ · May 2019

You need to specify your schema in a separate parameter called 'dbschema' else it will default to public.

opts['dbschema'] = 'SBG_PUBLISHED'
opts['table'] = 'ab_testing_data'

Please read following documentation for 9.1.x for the list of available parameters to configure: https://www.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/SparkConnector/WritingtoVerticaUsingDefaultSource.htm#Parameters

saivenkateshms · May 2019

Hi lenoy,

Now when i trying to write the data to vertica from hive am getting the below error can you please help me on this

ERROR S2VUtils: ERROR: Could not connect to Vertica: java.sql.SQLNonTransientConnectionException: [Vertica][VJDBC](100176) Failed to connect to host 10.81.189.230 on port 5433. Reason: Failed to establish a connection to the primary server or any backup address. RROR S2VUtils: getNodeIPsFromVertica: FAILED to retrieve node IPs from Vertica host:10.81.189.230 Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o357.save.

java.lang.Exception: ERROR: S2V.save(): Unable to retrieve Vertica node IPs from Vertica host:10.81.189.230
at com.vertica.spark.s2v.S2VUtils.getVerticaNodeIPs(S2VUtils.scala:630)
at com.vertica.spark.s2v.S2V.save(S2V.scala:377)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)

code

pyspark --jars /opt/vertica/vertica-spark2.1_scala2.11.jar,/opt/vertica/vertica-jdbc-9.1.1_20190520.jar

from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
sqlContext = SQLContext(sc)

opts={}
opts['dbschema'] = 'SBG_SANDBOX'
opts['table'] = 'test_sbg'
opts['db']='idea'
opts['user']='svenkatesh'
opts['password']='Password7'
opts['host']='10.81.189.230'
opts['hdfs_url']='hdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/'

df_new = sqlContext.sql("select product from sbg_hvc_ods.pc_all_products_ingestion_to_profile limit 1 ")

df_new.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="append", **opts)

saivenkateshms · May 2019

Hi lenoy,

Now when i trying to write the data to vertica from hive am getting the below error can you please help me on this

File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o145.save.

java.lang.Exception: S2V: FATAL ERROR for job S2V_job4908017182903424450. Job status information is available in the Vertica table SBG_SANDBOX.S2V_JOB_STATUS_USER_SVENKATESH. Unable to create/insert into target table SBG_SANDBOX.test_sbg with SaveMode: Overwrite. ERROR MESSAGE: ERROR: java.lang.Exception: S2V: FATAL ERROR for job S2V_job4908017182903424450. Either set web_hdfs_url or check VERIFY_HADOOP_CONF_DIR() passes with Vertica
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:339)
at com.vertica.spark.s2v.S2V.save(S2V.scala:389)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
sqlContext = SQLContext(sc)

opts={}
opts['dbschema'] = 'SBG_SANDBOX'
opts['table'] = 'test_sbg'
opts['db']='idea'
opts['user']='svenkatesh'
opts['password']='Password7'
opts['host']='10.81.189.230'
opts['hdfs_url']='hdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/'

df_new = sqlContext.sql("select product from sbg_hvc_ods.pc_all_products_ingestion_to_profile limit 1 ")

df_new.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="Overwrite", **opts)

saivenkateshms · May 2019

Hi lenoy ,

i tried by providing the webhdfs url as well.but am getting the below error, please help me on this

9/05/31 04:51:49 ERROR S2V: Failed to save DataFrame to Vertica table: SBG_SANDBOX.test_sbg Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.

java.lang.Exception: S2V: FATAL ERROR for job S2V_job1940324723276020621. Job status information is available in the Vertica table SBG_SANDBOX.S2V_JOB_STATUS_USER_SVENKATESH. Unable to create/insert into target table SBG_SANDBOX.test_sbg with SaveMode: Overwrite. ERROR MESSAGE: ERROR: java.sql.SQLException: [Vertica]VJDBC ERROR: Failed to glob [webhdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/vertica/S2V_job1940324723276020621/*.orc] because of error: [http://ip-10-68-64-76.us-west-2.compute.internal:8020/webhdfs/v1/user/hvc-prd-default-ged/vertica/S2V_job1940324723276020621/?op=LISTSTATUS&user.name=svenkatesh]: Curl Error: Couldn't connect to server
Error Details: Failed to connect to ip-10-68-64-76.us-west-2.compute.internal port 8020: Connection timed out
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:339)
at com.vertica.spark.s2v.S2V.save(S2V.scala:389)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)

LenoyJ · May 2019

The Vertica-Spark connector uses HDFS to dump the dataframe and then loads into Vertica. So you must configure your Vertica cluster to access HDFS first. To do that:

Locate your HDFS config files: core-site.xml and hdfs-site.xml.
Copy them over to every Vertica node you have at the same path. Say, /home/dbadmin
Run vsql and set it to the directory:
ALTER DATABASE dbname SET HadoopConfDir = '/home/dbadmin';
If you are on >8.1, verify if it is okay with:
SELECT VERIFY_HADOOP_CONF_DIR();
If you are on >9.2, a better function is available to verify:
SELECT HDFS_CLUSTER_CONFIG_CHECK();

Then try again.

ashishwagh9765 · December 2024

Hi Lenoy/Jim_Knicely,

Can I use vertica-spark2.1_scala2.11.jar with Vertica DB version 23.x?

mosheg · December 2024

It might work however the vertica-spark2.1_scala2.11.jar is not compatible with Vertica DB version 23.x.
This jar file is designed for older versions of Vertica and Spark, using Scala 2.11, which is outdated for modern Spark versions.

We're Moving!

Create My New Community Account Now

Class Not Found - com.vertica.spark.datasource.DefaultSource

Answers

Note: i tried without schema name as well still the same error

below is my code

view the new DataFrame

code