Undefined Number of Columns in R Polymorphic Function

vioravis · October 2014

Vertica 7.0 Programmer's Guide has the following example regarding Polymorphic functions:

RFactory <- function() {
list(name=RFunction, udxtype=c("transform"), intype=c("any"),
outtype=c("any"), outtypecallback=ReturnType)
}

Does outtype = "any" mean the same R UDF can return different number of columns depending on the inputs provided? I have tried using it but it fails everytime.

Some clarification on this would be helpful.

Thank you.

Ravi

Chris_C · November 2014

Hi Ravi,

Outtype("any") does indeed mean that the R UDF can return a variable number of columns depending on your function.

You must provide an outtypecallback function. The outtypecallback function must have the same parameters passed to it as your main function (x, y where X = input column(s) from Vertica and Y is any parameters). For example:

# Determine the return types based on the input types and sizes
polyTopKReturnType <- function(x,y)
{
     ret <- NULL
     for( i in 2:nrow(x))
     {
         rbind(ret,x[i,]) -> ret
     }
     ret
}

vioravis · November 2014

Thank you Chris. I would look into this example.

Andrew_Gibson · January 2015

I know this is an older post but it exactly relates to my issue. I just started looking at Vertica and R and I'm struggling a little with the code you provided here. I suppose I was thinking that x would be a dataframe, 1 column for each one sent through from Vertica, but based on the loop you' have here, this looks more like x has 1 row for every column sent from Vertica. I also believe (from the manual) that the returned object from this function is supposed to be a 4 column matrix with 1 row for each field returned defining the data-types. Is that correct? If so, now I'm thnking perhaps x is just a list of the Vertica data types for the fields sent to R ? If I'm right would it not also be correct (and perhaps simpler) to write .. polyTopKReturnType <- function(x,y) { # return all field definitions except the first one x[2:nrow(x),] }

Andrew_Gibson · January 2015

OK - after some debugging that involved saving objects to file within the code. Yes, x, is a set of field definitions. It actually is fed into the function as a data.frame 4 columns and 1 row for each input variable. like this datatype length scale name 1 float 0 0 p1 2 float 0 0 p2 I want to add another row in the return fields for my Cluster identifier (an integer, but there is a small problem; datatype and name are both factors not character vectors. > str(x) 'data.frame': 2 obs. of 4 variables: $ datatype: Factor w/ 1 level "float": 1 1 $ length : num 0 0 $ scale : num 0 0 $ name : Factor w/ 2 levels "p1","p2": 1 2 When I try to rbind an additional row it does not fail but neither does it add the data I want > rbind(x,c('int',NA,NA,'cluster')) datatype length scale name 1 float 0 0 p1 2 float 0 0 p2 3 Warning messages: 1: In '[<-.factor'('*tmp*', ri, value = "int") : invalid factor level, NA generated 2: In '[<-.factor'('*tmp*', ri, value = "cluster") : invalid factor level, NA generated I can work around it, but it took some effort to even find this was a problem as I can;t find a spec for what x is. Documentation (and examples) really could be better. Sadly this has not solved my problem just changed the error message. I'll open up a new post.

Chris_C · January 2015

Hi Andrew.

You cannot add new factors to the data.frame that way.

Change those factors to char then you can use the rbind:

> x$datatype <- as.character(x$datatype)

> x$name <- as.character(x$name)
 > str(x) 'data.frame': 2 obs. of  4 variables:  $ datatype: chr  " float" " float"  $ length  : int  0 0  $ scale   : int  0 0  $ name    : chr  " p1" " p2"
 > rbind(x,c('int',NA,NA,'cluster'))   datatype length scale    name 1    float      0     0      p1 2    float      0     0      p2 3      int   <NA>  <NA> cluster
 > x   datatype length scale name 1    float      0     0   p1 2    float      0     0   p2
 > x = rbind(x,c('int',NA,NA,'cluster'))
 > x   datatype length scale    name 1    float      0     0      p1 2    float      0     0      p2 3      int   <NA>  <NA> cluster

Andrew_Gibson · January 2015

Appreciate you getting back to me so quickly Chris. Looking at my comment on this screen it looks as though it was truncated somehow. Perhaps there is a character limit? Anyhow, yes, I figured out the workaround. My problem was that nowhere in the documentation or through web-search could I find a good spec as to what x (the input value) or the return value from this function needed to look like. Finding out that the input data-frame looks exactly like the outpurt was a big deal for me and something I thought worth sharing. Also, noticing that the datatype and name are factors makes it easy to remove records but potentially more challenging to add them., particulalry because when you run it from Vertica I don't see a good way to spot that there is a problem. FWIW - I'm impressed already with what I can do with Vertica/R but I can't help feeling that both the documentation and the R examples could be more extensive. Templates would help, the example code seems to have a couple of typos and a guide to debugging would be VERY nice indeed.

Chris_C · January 2015

Hi Andrew,

Have you looked at the most recent documentation? The section on R was recently updated.
I think that this page:
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/ExtendingHPVertica/UDx/UDxR/ROuttypeCallba...
provides the details that you are looking for.

I agree that more examples and debugging tips would be useful. I can put in a request to add debugging tips. As far as examples go, can you provide any suggestions or use cases that we might use as an example?

Also, you may be interested in Distributed R: http://www.vertica.com/hp-vertica-products/hp-vertica-distributed-r/ It is a new open source offering from HP Vertica (paid support is also available). Distributed R provides a native R environment that easily interfaces with Vertica and is scalable to accomodate huge data sets.

Thanks,
Chris

Andrew_Gibson · January 2015

Thanks Chris - yes I believe this is the version of the documentation I'm working with. A couple of thoughts: it does not tell us what x is and the example code (for the mul function) doesn't use it so I can't really reverse-engineer for it from this documentation. An example of both x content and structure and a non-trivial example that converts input to output would help a lot. (Knowing that the datatype and name fields come through as factors is a big deal.) Please correct me if I have misunderstood this, but the other thing that took me a while to realize was that if I wanted identifying information for each record in the SQL output, I would have to pass it through the UDTF. Without this, the results don't seem to be very useful: I have a list of cluster numbers but no way of knowing what they refer to :-) Also, if I'm building a general purpose UDTF, one that can be called against a variety fo data, I can't know exactly how many fields it will need to fully define a record. I think the polymorphic kmeans example in the documentation assumes it's only 1 field but that's not very realistic. I've got to be able to define this as a parameter in the SQL call and return the fields I need whether it's 1, 2 or 5 of them. I have in mind a range of UDTF functions to expose analytic functionalty (like kmeans) to SQL so that we more can rapidly build apps, but they will all need this polymorphic capability to be really useful. I am interested in Distributed R but right now I think I can get what I need from User Defined Functions, RODBC and perhaps some multi-core extensions to R. As the data volume increases, it's good to know there is somewhere to go :-) I'll be happy to provide you with some use-cases. Let's connect off-line. Andrew

We're Moving!

Create My New Community Account Now

Undefined Number of Columns in R Polymorphic Function

Comments

Leave a Comment