Class Not Found - com.vertica.spark.datasource.DefaultSource
Hi,
When i try to ingest data from the Vertica DB using Spark (java), am ending up with the error below.
stack-trace="java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
I understood that the jar is not added to my class path. However am not finding the actual jar file which i need to use/download.
Please let me know from where i can download the actual jar file?
0
This discussion has been closed.
Answers
Prior to Vertica version 9.1, the Spark Connector was distributed on the myVertica portal. If you have a version of Vertica prior to 9.1, you must download the correct connector file from the Vertica portal.
For Vertica 9.1:
hi jim,
i have installed vertica 9.1 clent on EMR, but i am not able to connect vertica from pyspark
java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.vertica.spark.datasource.DefaultSource.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 13 more
Ensure that JDBC and the Vertica-Spark Connector jar files are added to class path or specify as parameter when running pyspark
Something like:
./bin/pyspark --jars /opt/vertica/packages/SparkConnector/lib/vertica-spark2.1_scala2.11.jar,/opt/vertica/java/vertica-jdbc.jar
after setting the jars now am getting different error.
java.lang.Exception: Specified relation name "SBG_PUBLISHED.ab_testing_data" does not exist.
at com.vertica.spark.seg.SegmentsMetaInfo$class.getSyntheticSegExpr(SegmentsMetaInfo.scala:251)
at com.vertica.spark.datasource.DefaultSource.getSyntheticSegExpr(VerticaSource.scala:15)
at com.vertica.spark.seg.SegmentsMetaInfo$class.initSegInfo(SegmentsMetaInfo.scala:50)
at com.vertica.spark.datasource.DefaultSource.initSegInfo(VerticaSource.scala:15)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
Note: i tried without schema name as well still the same error
below is my code
pyspark --jars /opt/vertica/vertica-spark2.1_scala2.11.jar,/opt/vertica/vertica-jdbc-9.1.1_20190520.jar
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
sqlContext = SQLContext(sc)
opts={}
opts['table'] = 'SBG_PUBLISHED.ab_testing_data'
opts['db']='IDEA'
opts['user']='svenkatesh'
opts['password']='Password7'
opts['host']='10.81.189.230'
df_new=sqlContext.read.load(format="com.vertica.spark.datasource.DefaultSource", **opts)
view the new DataFrame
df_new.take(10)
You need to specify your schema in a separate parameter called 'dbschema' else it will default to public.
Please read following documentation for 9.1.x for the list of available parameters to configure: https://www.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/SparkConnector/WritingtoVerticaUsingDefaultSource.htm#Parameters
Hi lenoy,
Now when i trying to write the data to vertica from hive am getting the below error can you please help me on this
ERROR S2VUtils: ERROR: Could not connect to Vertica: java.sql.SQLNonTransientConnectionException: [Vertica][VJDBC](100176) Failed to connect to host 10.81.189.230 on port 5433. Reason: Failed to establish a connection to the primary server or any backup address. RROR S2VUtils: getNodeIPsFromVertica: FAILED to retrieve node IPs from Vertica host:10.81.189.230 Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o357.save.- java.lang.Exception: ERROR: S2V.save(): Unable to retrieve Vertica node IPs from Vertica host:10.81.189.230
at com.vertica.spark.s2v.S2VUtils.getVerticaNodeIPs(S2VUtils.scala:630)
at com.vertica.spark.s2v.S2V.save(S2V.scala:377)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
code
pyspark --jars /opt/vertica/vertica-spark2.1_scala2.11.jar,/opt/vertica/vertica-jdbc-9.1.1_20190520.jar
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
sqlContext = SQLContext(sc)
opts={}
opts['dbschema'] = 'SBG_SANDBOX'
opts['table'] = 'test_sbg'
opts['db']='idea'
opts['user']='svenkatesh'
opts['password']='Password7'
opts['host']='10.81.189.230'
opts['hdfs_url']='hdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/'
df_new = sqlContext.sql("select product from sbg_hvc_ods.pc_all_products_ingestion_to_profile limit 1 ")
df_new.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="append", **opts)
Hi lenoy,
Now when i trying to write the data to vertica from hive am getting the below error can you please help me on this
File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o145.save.- java.lang.Exception: S2V: FATAL ERROR for job S2V_job4908017182903424450. Job status information is available in the Vertica table SBG_SANDBOX.S2V_JOB_STATUS_USER_SVENKATESH. Unable to create/insert into target table SBG_SANDBOX.test_sbg with SaveMode: Overwrite. ERROR MESSAGE: ERROR: java.lang.Exception: S2V: FATAL ERROR for job S2V_job4908017182903424450. Either set web_hdfs_url or check VERIFY_HADOOP_CONF_DIR() passes with Vertica
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:339)
at com.vertica.spark.s2v.S2V.save(S2V.scala:389)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
sqlContext = SQLContext(sc)
opts={}
opts['dbschema'] = 'SBG_SANDBOX'
opts['table'] = 'test_sbg'
opts['db']='idea'
opts['user']='svenkatesh'
opts['password']='Password7'
opts['host']='10.81.189.230'
opts['hdfs_url']='hdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/'
df_new = sqlContext.sql("select product from sbg_hvc_ods.pc_all_products_ingestion_to_profile limit 1 ")
df_new.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="Overwrite", **opts)
Hi lenoy ,
i tried by providing the webhdfs url as well.but am getting the below error, please help me on this
9/05/31 04:51:49 ERROR S2V: Failed to save DataFrame to Vertica table: SBG_SANDBOX.test_sbg Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.- java.lang.Exception: S2V: FATAL ERROR for job S2V_job1940324723276020621. Job status information is available in the Vertica table SBG_SANDBOX.S2V_JOB_STATUS_USER_SVENKATESH. Unable to create/insert into target table SBG_SANDBOX.test_sbg with SaveMode: Overwrite. ERROR MESSAGE: ERROR: java.sql.SQLException: [Vertica]VJDBC ERROR: Failed to glob [webhdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/vertica/S2V_job1940324723276020621/*.orc] because of error: [http://ip-10-68-64-76.us-west-2.compute.internal:8020/webhdfs/v1/user/hvc-prd-default-ged/vertica/S2V_job1940324723276020621/?op=LISTSTATUS&user.name=svenkatesh]: Curl Error: Couldn't connect to server
Error Details: Failed to connect to ip-10-68-64-76.us-west-2.compute.internal port 8020: Connection timed out
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:339)
at com.vertica.spark.s2v.S2V.save(S2V.scala:389)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
The Vertica-Spark connector uses HDFS to dump the dataframe and then loads into Vertica. So you must configure your Vertica cluster to access HDFS first. To do that:
Run vsql and set it to the directory:
ALTER DATABASE dbname SET HadoopConfDir = '/home/dbadmin';
If you are on >8.1, verify if it is okay with:
SELECT VERIFY_HADOOP_CONF_DIR();
If you are on >9.2, a better function is available to verify:
SELECT HDFS_CLUSTER_CONFIG_CHECK();
Then try again.
Hi Lenoy/Jim_Knicely,
Can I use vertica-spark2.1_scala2.11.jar with Vertica DB version 23.x?
It might work however the vertica-spark2.1_scala2.11.jar is not compatible with Vertica DB version 23.x.
This jar file is designed for older versions of Vertica and Spark, using Scala 2.11, which is outdated for modern Spark versions.