Class Not Found - com.vertica.spark.datasource.DefaultSource
Hi,
When i try to ingest data from the Vertica DB using Spark (java), am ending up with the error below.
stack-trace="java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
I understood that the jar is not added to my class path. However am not finding the actual jar file which i need to use/download.
Please let me know from where i can download the actual jar file?
0
Answers
Prior to Vertica version 9.1, the Spark Connector was distributed on the myVertica portal. If you have a version of Vertica prior to 9.1, you must download the correct connector file from the Vertica portal.
For Vertica 9.1:
hi jim,
i have installed vertica 9.1 clent on EMR, but i am not able to connect vertica from pyspark
java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.vertica.spark.datasource.DefaultSource.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 13 more
Ensure that JDBC and the Vertica-Spark Connector jar files are added to class path or specify as parameter when running pyspark
Something like:
./bin/pyspark --jars /opt/vertica/packages/SparkConnector/lib/vertica-spark2.1_scala2.11.jar,/opt/vertica/java/vertica-jdbc.jar
after setting the jars now am getting different error.
java.lang.Exception: Specified relation name "SBG_PUBLISHED.ab_testing_data" does not exist.
at com.vertica.spark.seg.SegmentsMetaInfo$class.getSyntheticSegExpr(SegmentsMetaInfo.scala:251)
at com.vertica.spark.datasource.DefaultSource.getSyntheticSegExpr(VerticaSource.scala:15)
at com.vertica.spark.seg.SegmentsMetaInfo$class.initSegInfo(SegmentsMetaInfo.scala:50)
at com.vertica.spark.datasource.DefaultSource.initSegInfo(VerticaSource.scala:15)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
Note: i tried without schema name as well still the same error
below is my code
pyspark --jars /opt/vertica/vertica-spark2.1_scala2.11.jar,/opt/vertica/vertica-jdbc-9.1.1_20190520.jar
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
sqlContext = SQLContext(sc)
opts={}
opts['table'] = 'SBG_PUBLISHED.ab_testing_data'
opts['db']='IDEA'
opts['user']='svenkatesh'
opts['password']='Password7'
opts['host']='10.81.189.230'
df_new=sqlContext.read.load(format="com.vertica.spark.datasource.DefaultSource", **opts)
view the new DataFrame
df_new.take(10)
You need to specify your schema in a separate parameter called 'dbschema' else it will default to public.
Please read following documentation for 9.1.x for the list of available parameters to configure: https://www.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/SparkConnector/WritingtoVerticaUsingDefaultSource.htm#Parameters
Hi lenoy,
Now when i trying to write the data to vertica from hive am getting the below error can you please help me on this
ERROR S2VUtils: ERROR: Could not connect to Vertica: java.sql.SQLNonTransientConnectionException: [Vertica][VJDBC](100176) Failed to connect to host 10.81.189.230 on port 5433. Reason: Failed to establish a connection to the primary server or any backup address. RROR S2VUtils: getNodeIPsFromVertica: FAILED to retrieve node IPs from Vertica host:10.81.189.230 Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o357.save.- java.lang.Exception: ERROR: S2V.save(): Unable to retrieve Vertica node IPs from Vertica host:10.81.189.230
at com.vertica.spark.s2v.S2VUtils.getVerticaNodeIPs(S2VUtils.scala:630)
at com.vertica.spark.s2v.S2V.save(S2V.scala:377)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
code
pyspark --jars /opt/vertica/vertica-spark2.1_scala2.11.jar,/opt/vertica/vertica-jdbc-9.1.1_20190520.jar
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
sqlContext = SQLContext(sc)
opts={}
opts['dbschema'] = 'SBG_SANDBOX'
opts['table'] = 'test_sbg'
opts['db']='idea'
opts['user']='svenkatesh'
opts['password']='Password7'
opts['host']='10.81.189.230'
opts['hdfs_url']='hdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/'
df_new = sqlContext.sql("select product from sbg_hvc_ods.pc_all_products_ingestion_to_profile limit 1 ")
df_new.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="append", **opts)
Hi lenoy,
Now when i trying to write the data to vertica from hive am getting the below error can you please help me on this
File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o145.save.- java.lang.Exception: S2V: FATAL ERROR for job S2V_job4908017182903424450. Job status information is available in the Vertica table SBG_SANDBOX.S2V_JOB_STATUS_USER_SVENKATESH. Unable to create/insert into target table SBG_SANDBOX.test_sbg with SaveMode: Overwrite. ERROR MESSAGE: ERROR: java.lang.Exception: S2V: FATAL ERROR for job S2V_job4908017182903424450. Either set web_hdfs_url or check VERIFY_HADOOP_CONF_DIR() passes with Vertica
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:339)
at com.vertica.spark.s2v.S2V.save(S2V.scala:389)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
sqlContext = SQLContext(sc)
opts={}
opts['dbschema'] = 'SBG_SANDBOX'
opts['table'] = 'test_sbg'
opts['db']='idea'
opts['user']='svenkatesh'
opts['password']='Password7'
opts['host']='10.81.189.230'
opts['hdfs_url']='hdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/'
df_new = sqlContext.sql("select product from sbg_hvc_ods.pc_all_products_ingestion_to_profile limit 1 ")
df_new.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="Overwrite", **opts)
Hi lenoy ,
i tried by providing the webhdfs url as well.but am getting the below error, please help me on this
9/05/31 04:51:49 ERROR S2V: Failed to save DataFrame to Vertica table: SBG_SANDBOX.test_sbg Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 703, in save self._jwrite.save() File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.- java.lang.Exception: S2V: FATAL ERROR for job S2V_job1940324723276020621. Job status information is available in the Vertica table SBG_SANDBOX.S2V_JOB_STATUS_USER_SVENKATESH. Unable to create/insert into target table SBG_SANDBOX.test_sbg with SaveMode: Overwrite. ERROR MESSAGE: ERROR: java.sql.SQLException: [Vertica]VJDBC ERROR: Failed to glob [webhdfs://ip-10-68-64-76.us-west-2.compute.internal:8020/user/hvc-prd-default-ged/vertica/S2V_job1940324723276020621/*.orc] because of error: [http://ip-10-68-64-76.us-west-2.compute.internal:8020/webhdfs/v1/user/hvc-prd-default-ged/vertica/S2V_job1940324723276020621/?op=LISTSTATUS&user.name=svenkatesh]: Curl Error: Couldn't connect to server
Error Details: Failed to connect to ip-10-68-64-76.us-west-2.compute.internal port 8020: Connection timed out
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:339)
at com.vertica.spark.s2v.S2V.save(S2V.scala:389)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
The Vertica-Spark connector uses HDFS to dump the dataframe and then loads into Vertica. So you must configure your Vertica cluster to access HDFS first. To do that:
Run vsql and set it to the directory:
ALTER DATABASE dbname SET HadoopConfDir = '/home/dbadmin';
If you are on >8.1, verify if it is okay with:
SELECT VERIFY_HADOOP_CONF_DIR();
If you are on >9.2, a better function is available to verify:
SELECT HDFS_CLUSTER_CONFIG_CHECK();
Then try again.