VERTICA - SPARK CONNECTOR ISSUE

Ben_Fishman · March 2016

hey

my name is ben

im working with

veritca version is : 7.2.1

spark-connector is : 0.2.1

im trying to execute sql query of dataFrame like select * from where IN ("A","B","V")

the problem is the vertica parse the query in wrong way

the “(A,B,V)” is translated as a list

from vertica log:

2016-03-17 10:34:24.256 Init Session:0x7eff38012d50 [Session] <INFO> [PQuery] TX:0(...-24603:0xcc8) select …,...,...,... from ... where( (0x00000000ffffffff & hash()) >= 3579139414 a

nd (0x00000000ffffffff & hash()) <= 4294967297 ) AND (... in [Ljava.lang.Object; @5aca2402)

2016-03-17 10:34:24.256 Init Session:0x7eff38012940 <ERROR> @v_..._node0001: 42601/4856: Syntax error at or near "[" at character 205

additionally

im trying to execute query like select * from where col = "abc"

vertica parse the filter value "abc" as col

how can i filter string in spark sql - vertica dataframe ?

in each problem that i presented here its seems like a bug from vertica api

please help !

thank

ben.

EdwardM · March 2016

Hi Ben,

Could you describe in more detail (maybe with a code example) as to how you're performing this string-filter operation from Spark?

Thanks,

Ed

Ben_Fishman · March 2016

//function thats create RDD from vertica data source
def queryRdd(tableSchema :String, table :String)(implicit vConn:Map[String, String], sc :SparkContext): DataFrame ={

val sqlContext = new SQLContext(sc)

  vConn.+=("table" -> table)
//.+=("dbschema" -> tableSchema)

  val df = sqlContext.read.format(verticaSparkClass).options(vConn).load()
return df
}

//create DataFrame
val resultRdd = dalVertica.queryRdd(public,tableTest)

//query RDD
val filterdRdd = resultRdd.select("colA","colB").filter("colA ='test'")

ERROR as explaind .....from vertica log:

column test doesnt exist

OR

val filterdRdd = resultRdd.select("colA","colB").filter("colA in ('test1','test2')")

from vertica log:

2016-03-17 10:34:24.256 Init Session:0x7eff38012d50 [Session] <INFO> [PQuery] TX:0(...-24603:0xcc8) select …,...,...,... from ... where( (0x00000000ffffffff & hash()) >= 3579139414 a

nd (0x00000000ffffffff & hash()) <= 4294967297 ) AND (... in [Ljava.lang.Object; @5aca2402)

2016-03-17 10:34:24.256 Init Session:0x7eff38012940 <ERROR> @v_..._node0001: 42601/4856: Syntax error at or near "[" at character 205

EdwardM · March 2016

Hi Ben,

Is that "in" clause the canonical way of doing that in Spark? I was able to phrase your example as a multi-condition filter as follows:

df.where("a = 'X' OR a = 'Y'")

which should produce the desired result. However, you may encounter a bug in the connector related to strings. That issue will be fixed in the next release, which should be available next week. Thanks for your patience!

Ed

Ben_Fishman · March 2016

Hey Edward
Yes , that's "in" clause is the canonical way of doing that in Spark , and it's important for me too
I know that I can do "or" clause but that's not good for me
I have to filter like 100,000 - 500,000 values each query
There is a bug as I mentioned before that when you initiate query with "in" clause , vertica transform it to list object, when this bug will fixed too?

Another question is , if there is any limitation on query length ? Because the queries is generated from Scala code and can get bigger
Thanks
Ben

We're Moving!

Create My New Community Account Now

VERTICA - SPARK CONNECTOR ISSUE

Comments

Leave a Comment