How to receive data from DB tables using R-UDF
Using R-UDF to receive data from some DB tables, but R-UDF factory need to define intype and outype ,that means i need to define every column type in the factory ,but the function is used for all sorts of tables, the count of columns is unknown, it 's not only for one particular table, how to implement this function using R-UDF , any idea?
0
Comments
>> Using R-UDF to receive data from some DB tables
https://community.vertica.com/vertica/topics/can_i_open_odbc_connection_from_within_vertica_using_r_...
>> how to implement this function using R-UDF , any idea?
Like function overloading. Take a look on Polymorphic functions:
https://my.vertica.com/docs/7.0.x/HTML/index.htm#Authoring/ProgrammersGuide/UserDefinedFunctions/UDx... and since Vertica data type are limited set of types, its little annoying but possible to define a polymorphic function, that can process any data type.
On that note, thanks Daniel to the link about ploymorphic types. Haifeng, let us know if that solves your problem.
Is it correct:
Table -> vector
Schema -> dimension
Database -> vector space
?
Usually this is sufficient. (Sometimes you have to think about it a little, but from a performance perspective it's usually worth the thought.)
We don't currently have a best practice for the cases where it's not sufficient. As you note, it's not something that's going to perform well; there are various disadvantages, etc. I know it's something that is sometimes necessary; we don't have a good best practice, and that is a limitation. On the other hand, it does encourage many people to figure out how to pass the data that they need into the UDF as an argument; in many cases this is not intuitive to the developer ("I need data? Write a SQL query!") but it yields a much-faster and more robust solution.
Looks like Vertica team creates a documentation system and so far moved all docs to one location: http://www.vertica.com/documentation/hp-vertica-analytics-platform-7-0-x-product-documentation/
take a look on left menu:
Due R limitations most of (if not all) parallelization relays on Vertica
(and I have to say that Vertica solves it very well, for example Vertica created 16 threads for partitioned anova function on each [32 CPU node] and [1000 - 2000+ partitions])
http://www.qualitytrainingportal.com/resources/msa/rr_using_anova.htm
http://www.vertica.com/2013/02/21/presto-distributed-r-for-big-data/
http://www.vertica.com/2012/12/21/a-deeper-dive-on-vertica-r/
Vertica team developed some package that allows you to run UDF-x R on Vertica and also team did a big approach - they succeed to parallelize R via partitions (each partition was processed in it own thread/subprocess).
But... but it didn't solved a problem for complex algorithms where full scan required like mean, or where recursion is required (and it can't be solved with "current approach" due R limitations and not Vertica). Vertica did an amazing thing, its not so easy in distributed systems, especially with inter-processing interactions - hello live/dead locks, memory allocation and segmentation fault. Its not stops here on algorithms only. You need to care for HA, so hello linux kernel(CPU modes switching? Or Vertica does not care about it?I don't know, but I think that cares) and so far.
So Vertica did a next step and now we can get all benefits with Distributed R much more suitable for large clusters.
https://community.vertica.com/vertica/topics/hp_vertica_innovations_tell_us_what_you_think
So Vertica does do evolution/revolution
[PS]
I always was wonder why it called zygote? Now I think I understand
but who called it zygote should be familiar with genetics