Returning a list in R UDF
I am working on a R UDF that requires returning a list of two data frames: I have given an example of what I am looking for below:
df <- data.frame(x1 = runif(1000), x2 = runif(1000), x3 = runif(1000), x4 = runif(1000)) returnList <- function(dataFrame) { x1 <- dataFrame[,1] x2 <- dataFrame[,2] x3 <- dataFrame[,3] x4 <- dataFrame[,4] dataFrame1 = data.frame(newX1 = x1[1:200],newX2 = x2[1:200]) dataFrame2 = data.frame(newX3 = x3[1:500],newX4 = x4[1:500]) return(list(dataFrame1,dataFrame2)) } returned <- returnList(df)Is there a way to accomplish this using Vertica R UDF i.e. return a list instead of a dataframe? Thank you. Ravi
0
Comments
Or if it is possible to combine the data frames in a single one do so. And separate them manually on another module.
Yes, its possible, use in transform function (transform function can be used as any function type - scalar, aggregative, analytic)
If you are looking to return list as columns so its also possible, but you need to define all columns in OUTTYPE.
FYI: table can has max 1600 columns.
Simple example: Usage:
In your answer you're returning data.frames not R lists. A list can contain R objects, other lists or data frames without constraints.
So what the difference between list is "column of values" and list is "row of values"?
Do you need a row or values?
In R you can return
list(obj1 = 1,
obj2 = data.frame,
obj3 = list(obj31 = single value,
objj32 = matrix)
obj4 = R's linear regression object
)
UDFs return a dataframe only because in the end they have to go through vertica.
A sequence of values, as you write it is not neccesarily a list in R, but a simple vector. A group of vector (of the same dimensions) can be coerced to a matrix or a data frame (With certain restrictions due to object type). A list has basically no restrictions.
Hope that helped
No. It doesn't helps at all. It you do not understands Vertica limitations.
1. Do you understands that RDBMS its strong typed and statically typed "query language"?
2. Also Vertica has limitation of 1600 columns per table, while list can be much more larger.
3. Also each field is basic built in Vertica data type, while object of list can be complicated, Vertica has no arrays like Postgres.
If you understands all this limitation so why you ask this question?
It is generally useful if you are able to transform your data into a form that Vertica understands. Vertica provides a pretty flexible set of semantics for dealing with data in all different forms. You can return multiple columns; you can effectively omit some columns in some cases (Vertica is very efficient with long strings of NULLs); etc.
One approach, with its own pro's an con's, to the original question might be to have output columns for each data frame, plus an additional output column saying which frame the current row corresponds to. Then write out frame 1 (and leave frame 2's columns as null); then vice versa. There are various other similar workarounds. Not ideal, granted; but it gives you the ability to write the data out in a form that's understood (and can be manipulated) by Vertica.
If you have an R structure that fundamentally doesn't map to something that can be represented in SQL, the only real workaround is to create a VARBINARY (or, in Vertica 7, LONG VARBINARY) column and use an R serialization library to turn your data structure into a byte-string that can be saved into the VARBINARY column. You can then read it out and deserialize it later with other R code, and get the original object back. This tends to be quite slow; it's also limited in the size of the objects you can write out per row. And you can't manipulate the resulting structure in SQL; not generally in interesting ways at least. But it lets you deal with arbitrary R types.
A possible alternative to all of this is to use our newly-released distributedR functionality. It hides a lot of this sort of stuff, and makes it possible to write distributed programs using various R data structures all in native R.
Adam
Could you please provide some information on the distributedR functionality? Is it available now in Vertica 7 (community edition)? Is it included in the R language pack? Is there any documentation that shows how to use the distributed feature in R? I don't see any mention of it on Vertica's website. Thank you.
On the original question that I asked, I found out a work around and worked with data frames instead Thanks for your inputs.
Sure -- take a look at:
https://vertica.hpwsportal.com/#/Product/{%22productId%22:%22420%22}/Show
DistributedR is a free Vertica add-on. I believe that it will work with the Vertica 7 CE. It is a separate download/installation. There's a YouTube video at that URL that describes some basic usage. If you download the package, it contains a README and an installation guide.
The URL above also contains a link to a forum topic about the feature. Feel free to post any questions or comments there. It's a brand-new feature; we're definitely interested in feedback.
Adam
Depending on your requirements you could use a JSON string as output in a Varchar Column
I've been using either jsonlite or RJSON for object conversion (back and forth).
For this to work you need to do the following
1) Convert R object to JSON with toJSON()
2) Convert JSON Object to character with as.character()
3) Output as a Character Column in a data.frame
Wherever the output is used you can either reconvert into an R object or parse it as JSON.
Hope this helps you (at least in the future)