Hi Lenoy,
Thanks for the detailed exaplanation but will need your help to implement this in pyspark code ...searched on google but could not find more
based on below link I tried https://stackoverflow.com/questions/51731998/how-to-add-custom-jdbc-dialects-in-pyspark
Are you saying
a) save the verticadialect.scala shared earlier in a location and the call pyspark like below?
pyspark2 --jars /home/x/vertica-9.0.1_spark2.1_scala2.11.jar,/home/x/vertica-jdbc-9.2.0-0.jar,/home/x/VerticaDialect.scala
Tried below but getting error
from py4j.java_gateway import java_import
gw = spark.sparkContext._gateway
java_import(gw.jvm, "com.me.VerticaDialect")
gw.jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(gw.jvm.com.me.VerticaDialect())
but i get an error
Traceback (most recent call last):
File "", line 1, in
TypeError: 'JavaPackage' object is not callable
Please help us to implement in pyspark as well so that it will be helful for clients using pyspark not scala.
Hi, I believe you would need to compile the scala file into a JAR and add to the classpath. We will run some tests to determine the best approach but it may be fairly complex to set up.
Hi Bryan
It will be very beneficial for clients who use pyspark for pushing data in vertica to have this setup properly,if you can guide us step by step to work on my code then others can also leverage the same.
Is there a way that dialect can be directly written for pyspark? (just asking)
Also what will be the best way to integrate this with spark rather then registering dialect everytime
Prakhar
@Prakhar84, quick tangent - I'm curious, what became of the checking of the connectivity between Vertica & HDFS? Most of my customers just get the Spark Connector working which usually solves most issues. I believe you raised another discussion here. Let's continue that discussion there.
Hi, a workaround we have suggested for other customers is to write Parquet files to a temporary folder that Vertica can read, then use vertica-python driver to issue a COPY command to import the Parquet file (see http://github.com/vertica/vertica-python for details)
We are investigating better solutions; however, and in my own opinion, we should implement a complete VerticaDialect and commit to Apache Spark to fix this for all Sprk programs whether Java, Scala, and PySpark. It will take a long time to develop this and then a long time for Apache to accept and publish as part of next Spark though.
Hi, you can add compiled classes to the classpath. However, you would need to rebuild Spark from a GitHub checkout after applying the following patch: https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
This would embed the VerticaDialect into the Spark build. To avoid having to reinstall the custom Spark build, you can extract the compiled VerticaDialect class and apply that at runtime as part of the classpath.
Answers
Hi Lenoy,
Thanks for the detailed exaplanation but will need your help to implement this in pyspark code ...searched on google but could not find more
based on below link I tried
https://stackoverflow.com/questions/51731998/how-to-add-custom-jdbc-dialects-in-pyspark
Are you saying
a) save the verticadialect.scala shared earlier in a location and the call pyspark like below?
pyspark2 --jars /home/x/vertica-9.0.1_spark2.1_scala2.11.jar,/home/x/vertica-jdbc-9.2.0-0.jar,/home/x/VerticaDialect.scala
Tried below but getting error
from py4j.java_gateway import java_import
gw = spark.sparkContext._gateway
java_import(gw.jvm, "com.me.VerticaDialect")
gw.jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(gw.jvm.com.me.VerticaDialect())
but i get an error
Traceback (most recent call last):
File "", line 1, in
TypeError: 'JavaPackage' object is not callable
Please help us to implement in pyspark as well so that it will be helful for clients using pyspark not scala.
Hi, I believe you would need to compile the scala file into a JAR and add to the classpath. We will run some tests to determine the best approach but it may be fairly complex to set up.
Hi Bryan
It will be very beneficial for clients who use pyspark for pushing data in vertica to have this setup properly,if you can guide us step by step to work on my code then others can also leverage the same.
Is there a way that dialect can be directly written for pyspark? (just asking)
Also what will be the best way to integrate this with spark rather then registering dialect everytime
Prakhar
@Prakhar84, quick tangent - I'm curious, what became of the checking of the connectivity between Vertica & HDFS? Most of my customers just get the Spark Connector working which usually solves most issues. I believe you raised another discussion here. Let's continue that discussion there.
@Bryan_H
Please let us know of any solution for this dialect in pyspark .
Thanks for your guidance so far.
Prakhar
Hi, a workaround we have suggested for other customers is to write Parquet files to a temporary folder that Vertica can read, then use vertica-python driver to issue a COPY command to import the Parquet file (see http://github.com/vertica/vertica-python for details)
We are investigating better solutions; however, and in my own opinion, we should implement a complete VerticaDialect and commit to Apache Spark to fix this for all Sprk programs whether Java, Scala, and PySpark. It will take a long time to develop this and then a long time for Apache to accept and publish as part of next Spark though.
I'd also recommend checking with the Spark community on how to use JDBC dialects in PySpark.
Hi, you can add compiled classes to the classpath. However, you would need to rebuild Spark from a GitHub checkout after applying the following patch:
https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
This would embed the VerticaDialect into the Spark build. To avoid having to reinstall the custom Spark build, you can extract the compiled VerticaDialect class and apply that at runtime as part of the classpath.