We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


VERTICA - SPARK CONNECTOR ISSUE — Vertica Forum

VERTICA - SPARK CONNECTOR ISSUE

hey

 

my name is ben

 

im working with

veritca version is : 7.2.1
spark-connector is : 0.2.1
 
im trying to execute sql query of dataFrame like select * from where IN ("A","B","V")
the problem is the vertica parse the query in wrong way
the “(A,B,V)” is translated as a list
 
from vertica log:
 
2016-03-17 10:34:24.256 Init Session:0x7eff38012d50 [Session] <INFO> [PQuery] TX:0(...-24603:0xcc8) select …,...,...,... from ... where( (0x00000000ffffffff & hash()) >= 3579139414 a
nd (0x00000000ffffffff & hash()) <= 4294967297 ) AND (... in [Ljava.lang.Object; @5aca2402)
2016-03-17 10:34:24.256 Init Session:0x7eff38012940 <ERROR> @v_..._node0001: 42601/4856: Syntax error at or near "[" at character 205
 
 
additionally 
im trying to execute query like select * from where col = "abc"
vertica parse the filter value "abc" as col
how can i filter string in spark sql - vertica dataframe ?
 
in each problem that i presented here its seems like a bug from vertica api 
 
please help !
 
thank 
ben.

Comments

  • Hi Ben,

     

    Could you describe in more detail (maybe with a code example) as to how you're performing this string-filter operation from Spark?

     

    Thanks,

    Ed

  • //function thats create RDD from vertica data source
    def
    queryRdd(tableSchema :String, table :String)(implicit vConn:Map[String, String], sc :SparkContext): DataFrame ={

    val sqlContext = new SQLContext(sc)

    vConn.+=("table" -> table)
    //.+=("dbschema" -> tableSchema)

    val df = sqlContext.read.format(verticaSparkClass).options(vConn).load()
    return df
    }

     

    //create DataFrame
    val
    resultRdd = dalVertica.queryRdd(public,tableTest)

     

    //query RDD
    val filterdRdd = resultRdd.select("colA","colB").filter("colA ='test'")

    ERROR as explaind .....from vertica log:  

    column test doesnt exist

     

    OR

     

    val filterdRdd = resultRdd.select("colA","colB").filter("colA in ('test1','test2')")

     

     
    from vertica log:
     
    2016-03-17 10:34:24.256 Init Session:0x7eff38012d50 [Session] <INFO> [PQuery] TX:0(...-24603:0xcc8) select …,...,...,... from ... where( (0x00000000ffffffff & hash()) >= 3579139414 a
    nd (0x00000000ffffffff & hash()) <= 4294967297 ) AND (... in [Ljava.lang.Object; @5aca2402)
    2016-03-17 10:34:24.256 Init Session:0x7eff38012940 <ERROR> @v_..._node0001: 42601/4856: Syntax error at or near "[" at character 205 

     

  • Hi Ben,

     

    Is that "in" clause the canonical way of doing that in Spark? I was able to phrase your example as a multi-condition filter as follows:

     

    df.where("a = 'X' OR a = 'Y'")

     

    which should produce the desired result. However, you may encounter a bug in the connector related to strings. That issue will be fixed in the next release, which should be available next week. Thanks for your patience!

     

    Ed

  • Hey Edward
    Yes , that's "in" clause is the canonical way of doing that in Spark , and it's important for me too
    I know that I can do "or" clause but that's not good for me
    I have to filter like 100,000 - 500,000 values each query
    There is a bug as I mentioned before that when you initiate query with "in" clause , vertica transform it to list object, when this bug will fixed too?

    Another question is , if there is any limitation on query length ? Because the queries is generated from Scala code and can get bigger
    Thanks
    Ben

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file