VERTICA - SPARK CONNECTOR ISSUE

hey

 

my name is ben

 

im working with

veritca version is : 7.2.1
spark-connector is : 0.2.1
 
im trying to execute sql query of dataFrame like select * from where IN ("A","B","V")
the problem is the vertica parse the query in wrong way
the “(A,B,V)” is translated as a list
 
from vertica log:
 
2016-03-17 10:34:24.256 Init Session:0x7eff38012d50 [Session] <INFO> [PQuery] TX:0(...-24603:0xcc8) select …,...,...,... from ... where( (0x00000000ffffffff & hash()) >= 3579139414 a
nd (0x00000000ffffffff & hash()) <= 4294967297 ) AND (... in [Ljava.lang.Object; @5aca2402)
2016-03-17 10:34:24.256 Init Session:0x7eff38012940 <ERROR> @v_..._node0001: 42601/4856: Syntax error at or near "[" at character 205
 
 
additionally 
im trying to execute query like select * from where col = "abc"
vertica parse the filter value "abc" as col
how can i filter string in spark sql - vertica dataframe ?
 
in each problem that i presented here its seems like a bug from vertica api 
 
please help !
 
thank 
ben.

Comments

  • Hi Ben,

     

    Could you describe in more detail (maybe with a code example) as to how you're performing this string-filter operation from Spark?

     

    Thanks,

    Ed

  • //function thats create RDD from vertica data source
    def
    queryRdd(tableSchema :String, table :String)(implicit vConn:Map[String, String], sc :SparkContext): DataFrame ={

    val sqlContext = new SQLContext(sc)

    vConn.+=("table" -> table)
    //.+=("dbschema" -> tableSchema)

    val df = sqlContext.read.format(verticaSparkClass).options(vConn).load()
    return df
    }

     

    //create DataFrame
    val
    resultRdd = dalVertica.queryRdd(public,tableTest)

     

    //query RDD
    val filterdRdd = resultRdd.select("colA","colB").filter("colA ='test'")

    ERROR as explaind .....from vertica log:  

    column test doesnt exist

     

    OR

     

    val filterdRdd = resultRdd.select("colA","colB").filter("colA in ('test1','test2')")

     

     
    from vertica log:
     
    2016-03-17 10:34:24.256 Init Session:0x7eff38012d50 [Session] <INFO> [PQuery] TX:0(...-24603:0xcc8) select …,...,...,... from ... where( (0x00000000ffffffff & hash()) >= 3579139414 a
    nd (0x00000000ffffffff & hash()) <= 4294967297 ) AND (... in [Ljava.lang.Object; @5aca2402)
    2016-03-17 10:34:24.256 Init Session:0x7eff38012940 <ERROR> @v_..._node0001: 42601/4856: Syntax error at or near "[" at character 205 

     

  • Hi Ben,

     

    Is that "in" clause the canonical way of doing that in Spark? I was able to phrase your example as a multi-condition filter as follows:

     

    df.where("a = 'X' OR a = 'Y'")

     

    which should produce the desired result. However, you may encounter a bug in the connector related to strings. That issue will be fixed in the next release, which should be available next week. Thanks for your patience!

     

    Ed

  • Hey Edward
    Yes , that's "in" clause is the canonical way of doing that in Spark , and it's important for me too
    I know that I can do "or" clause but that's not good for me
    I have to filter like 100,000 - 500,000 values each query
    There is a bug as I mentioned before that when you initiate query with "in" clause , vertica transform it to list object, when this bug will fixed too?

    Another question is , if there is any limitation on query length ? Because the queries is generated from Scala code and can get bigger
    Thanks
    Ben

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file