Problems with R polymorphic UDTF
I'm attempting to write a polymorphic kmeans function that returns all the input fields AND the integer cluster identifier. I have user defined scalar functions working just fine, the example transform function and I have managed a polymorphic input transform. The problem is when I want the output to be polymorphic. I have defined outtypecallback and a related function. The outtypecallback function is being called and (by saving objects to file from within the code) I figured out the inbound spec for x in my outtypecallback function and hence, I assume, the spec for the return object. Here's what the return object looks like given 2 float inputs and the cluster identifier (integer) > x datatype length scale name 1 float 0 0 p1 2 float 0 0 p2 3 int 0 0 Cluster > str(x) 'data.frame': 3 obs. of 4 variables: $ datatype: Factor w/ 2 levels "float","int": 1 1 2 $ length : chr "0" "0" "0" $ scale : chr "0" "0" "0" $ name : Factor w/ 3 levels "Cluster","p1",..: 2 3 1 As far as I can tell (by adding logging into the code) this is the last (of my functions) that it calls before failing with this error below. I can see repeated calls to the factory function, a couple to the parametercallback function and this last one to the outtypecallback function. It's not getting to the main clustering routine at all. Any ideas ???
0
Comments
Can you please share your Factory and Outtypecallback functions?
Thanks,
Arash
KMeansClusterFactory <- function (){ x <- list(
name = KMeansCluster
,udxtype='transform'
,intype = 'any'
,outtype = 'any'
,outtypecallback = KMeansClusterReturnTypes
,parametertypecallback = KMeansClusterParameters
)
x
}
KMeansClusterReturnTypes <- function(x,y){ # this is the callback function used by Vertica to decide
# despite what the manual says, x here IS NOT be the
# the same thing that is passed to the main function
#
# x is a 4 column data-frame that specifies data types for each field.
# something like this:
# datatype length scale name
# 1 float 0 0 p1
# 2 float 0 0 p2
# x[,1] datatype and x[,4] are coming through as factors though
# which is a problem when I try to add a new record that does not share
# the same factor levels - it adds a records but uses NA values instead of the levels I want.
# Let's try converting factors to characters first
x[,1] <- as.character(x[,1])
x[,4] <- as.character(x[,4])
# return data information for all input fields
# PLUS the Cluster field we just added
x <- rbind(x, c('int',0,0,'Cluster'))
# and back to factors (just in case)
x[,1] <- as.factor(x[,1])
x[,4] <- as.factor(x[,4])
x
}
Vertica UDx debugging can be turned on by using the following statement:
Select set_debug_log('EE','UDx_Fence');
To log from the R UDx use:
vertica_log(str)
and the str will be written to UDxLogs/UDxFencedProcesses.log
Many months later I'm coming back to this and I cannot find the "vertica_log(str)" function. Is there a Vertica package I should be loading to get it?
Hi Andrew!
I know I am really late to the discussion, but I think I can shed some light here. Let me ellaborate
1) You're right, the object that is received by the outputtypecallback is not the same as the one received by the main function. It is a data.frame object that describes each column in the input (which I guess is Vertica's solution to transferring objects between R and Vertica)
2) Upon executing a UDF, Vertica will call the Factory, Parameters and Output functions before executing the MAIN function (the code you're interested in) This is because Vertica needs to know how will the inputs/outputs be "translated" before actually executing the code. This led to some setbacks in a previous development I was involved in.
3) Your error seems to be in the fact that you're using factor as the output column format. I know R receives the table with that format, but that seems to be a Vertica/R issue not solved yet by the Vertica team. Try using character type for that. In my functions I usually build the object from the ground up by initializing with NAs and the column types I've been using are "chr", "logical", "logical", "chr" . I've never used columns 2 and 3, hence the type.