How to receive data from DB tables using R-UDF

Using R-UDF to receive data from some DB tables, but R-UDF factory need to define intype and outype ,that means i need to define every column type in the factory ,but the function is used for all sorts of tables, the count of columns is unknown, it 's not only for one particular table, how to implement this function using R-UDF , any idea?


  • Hi!

    >> Using R-UDF to receive data from some DB tables

    >> how to implement this function using R-UDF , any idea?
    Like function overloading. Take a look on Polymorphic functions:
                intype=c("any"),             outtype=c("any"), 
    and since Vertica data type are  limited set of types, its little annoying but possible to define a polymorphic function, that can process any data type.

  • For Daniel's R-ODBC link:  I wouldn't recommend using R-ODBC in this scenario unless you've looked at a bunch of alternatives and have determined that it's the only way.  You would usually use either R-ODBC or R-UDx; not both.  (Though either should currently work.)

    On that note, thanks Daniel to the link about ploymorphic types.  Haifeng, let us know if that solves your problem.
  • Scalar/Polymorphic functions, projections - too much academy :)
    Is it correct:
    Table -> vector
    Schema -> dimension
    Database -> vector space
  • Thanks very much ,Daniel, the link you mentioned is for vertica 7.0, is this work for vertica 6.1.3 ?  since i am using vertica 6.1.3 .
  • Thanks Adam, if i got this solved , i will update you guys.  Actually i have another question for R-UDF , if i want to access DB inside UDF, is this true that i need to open session ,connect to vertica with R-ODBC driver ?  The UDF is called in select clauses, so the UDF will open lots of session to access DB , right ? that should be bad performance , right ?   Do you have best practice for this (inside UDF, need to get some values through SQL query)?
  • So, the R-UDF model is not that R-UDF's get data from Vertica; it is that Vertica gives data to the R-UDF.  So you should make sure to pass in whatever you need as an argument to the UDF when you call it.

    Usually this is sufficient.  (Sometimes you have to think about it a little, but from a performance perspective it's usually worth the thought.)

    We don't currently have a best practice for the cases where it's not sufficient.  As you note, it's not something that's going to perform well; there are various disadvantages, etc.  I know it's something that is sometimes necessary; we don't have a good best practice, and that is a limitation.  On the other hand, it does encourage many people to figure out how to pass the data that they need into the UDF as an argument; in many cases this is not intuitive to the developer ("I need data?  Write a SQL query!") but it yields a much-faster and more robust solution.
  • Sorry, for late answer :( (GetSatisfaction im not satisfied!!!)

    Looks like Vertica team creates a documentation system and so far moved all docs to one location:

    take a look on left menu:

  • I have some practice: it very depends on partitioning, how it can be parallelized.
    Due R limitations most of (if not all) parallelization relays on Vertica

    (and I have to say that Vertica solves it very well, for example Vertica created 16 threads for partitioned anova function on each [32 CPU node] and [1000 - 2000+ partitions])

  • thanks a lot , you should means there is no difference between vertica 6.1 and 7.0 on the R perspective , right ?
  • yes, agree with you , actually , i want to use R-UDF just like store procedure , since vertica does not have SP, so i'd like to use R-UDF to implement some logic which can not be done by SQL , you know , i don't want to use JAVA to implement that ,that's too heavy for some little features.
  • Not, exactly.  Vertica blog covers this topic much more better than documentation (IMO), but don't use in examples of blog due wordpress formatting.

    Vertica team developed some package that allows you to run UDF-x R on Vertica and also team did a big approach - they succeed to parallelize R via partitions (each partition was processed in it own thread/subprocess).
    But...  but it didn't solved a problem for complex algorithms where full scan required like mean, or where recursion is required (and it can't be solved with "current approach" due R limitations and not Vertica). Vertica did an amazing thing, its not so easy in distributed systems, especially with inter-processing interactions - hello live/dead locks, memory allocation and segmentation fault. Its not stops here on algorithms only. You need to care for HA, so hello linux kernel(CPU modes switching? Or Vertica does not care about it?I don't know, but I think that cares) and so far.

    So Vertica did a next step and now we can get all benefits with Distributed R much more suitable for large clusters.

    So Vertica does do evolution/revolution ;)


    I always was wonder why it called zygote? Now I think I understand :)
    but who called it zygote should be familiar with genetics

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file