Options

Generating Row and Column Names when using R UDF

I am reposting the query I that started on vertica-forums.com (http://vertica-forums.com/viewtopic.php?f=62&t=1158) Hi, I am trying to use the R UDF to create a cross tab and have given the R function and the Vertica commands below. generateCrosstab <- function(x) { var1 = x[,1] var2 = x[,2] crossTab <- as.data.frame.matrix(table(var1,var2)) crossTab } crossTab_factory <- function() { inlist <- rep("int", 2) outlist <- c("int", "int") list(name=generateCrosstab, udxtype=c("transform"), intype=inlist, outtype=outlist) } DROP LIBRARY crossTab CASCADE; CREATE LIBRARY crossTab AS '/home/dbadmin/test/crossTab.r' LANGUAGE 'R'; CREATE TRANSFORM FUNCTION generateCrosstab AS NAME 'crossTab_factory' LIBRARY crossTab ; SELECT generateCrosstab(column1, column2) OVER() FROM testdb.table; This query generates the following output: col0 | col1 --------+-------- 466881 | 655735 21047 | 28336 (2 rows) Is there any way to generate the row names and the column names? I have given an example of what I am looking for: setosa versicolor virginica setosa 50 0 0 versicolor 0 50 0 virginica 0 0 50 Would appreciate any help on this? Thank you. Ravi

Comments

  • Options
    It looks like the R codes are now showing up in the original post. I have pasted them below: generateCrosstab <- function(x) { var1 = x[,1] var2 = x[,2] crossTab <- as.data.frame.matrix(table(var1,var2)) crossTab } crossTab_factory <- function() { inlist <- rep("int", 2) outlist <- c("int", "int") list(name=generateCrosstab, udxtype=c("transform"), intype=inlist, outtype=outlist) } DROP LIBRARY crossTab CASCADE; CREATE LIBRARY crossTab AS '/home/dbadmin/test/crossTab.r' LANGUAGE 'R'; CREATE TRANSFORM FUNCTION generateCrosstab AS NAME 'crossTab_factory' LIBRARY crossTab ; SELECT generateCrosstab(column1, column2) OVER() FROM testdb.table;
  • Options
    It looks like some the code I posted got clipped. I have pasted it below: generateCrosstab <- function(x) { var1 = x[,1] var2 = x[,2] crossTab <- as.data.frame.matrix(table(var1,var2)) crossTab } crossTab_factory <- function() { inlist <- rep("int", 2) outlist <- c("int", "int") list(name=generateCrosstab, udxtype=c("transform"), intype=inlist, outtype=outlist) } DROP LIBRARY crossTab CASCADE; CREATE LIBRARY crossTab AS '/home/dbadmin/test/crossTab.r' LANGUAGE 'R'; CREATE TRANSFORM FUNCTION generateCrosstab AS NAME 'crossTab_factory' LIBRARY crossTab ; SELECT generateCrosstab(column1, column2) OVER() FROM testdb.table;
  • Options
    The code got clipped in both the posts. The following link contains the entire code: http://vertica-forums.com/viewtopic.php?f=62&t=1158
  • Options
    Hi Ravishankar, Hm... Well, in SQL, rows don't exactly have names... If you want a record to have a name, you can modify the code that produces that record (ie., your R code) to emit one more column, a string column. Then use that column as a "name" column. The name will show up as a value next to the other contents of the row, which is I think what you want? SQL columns do have names, though. You can specify the names of the columns that your function emits in that function's factory. See the K-means example in /opt/vertica/sdk/examples/RFunctions/RFunctions.R (note the "outnames" list value). Adam
  • Options
    Hi Adam, Thank you I have tried generating the row and column names in R and attach them to the data frame. I am able to attach the row name as a new column as follows:
      generateCrosstab <- function(x)  {   var1 = x[,1]   var2 = x[,2]   crossTab <- as.data.frame.matrix(table(var1,var2))   crossTab$rowname <-  rownames(crossTab)   save(file=modelfile, crossTab)   crossTab  }    crossTab_factory <- function()  {   inlist <- rep("int", 2)   outlist <- c("int","int","varchar")   list(name=generateCrosstab, udxtype=c("transform"), intype=inlist, outtype=outlist)  }  
    However, when I try to add the column name (as a row), I am getting an error: I have given the code used below:
      generateCrosstab <- function(x)  {   var1 = x[,1]   var2 = x[,2]   crossTab <- as.data.frame.matrix(table(var1,var2))   crossTab$rowname <-  rownames(crossTab)   crossTab <- rbind(crossTab,colnames(crossTab))       save(file=modelfile, crossTab)   crossTab  }    crossTab_factory <- function()  {   inlist <- rep("int", 2)   outlist <- c("varchar","varchar","varchar")   list(name=generateCrosstab, udxtype=c("transform"), intype=inlist, outtype=outlist)  }    
    I get the following error ERROR 3399: Failure in UDx RPC call InvokeProcessPartition(): Error calling processPartition() in User Defined Object [generateCrosstab] at [RInterface.cpp:1006], error code: 0, message: Exception in processPartitionForR: not compatible with REALSXP I have changed all the column types to varchar since the column names are character (though the other values in the cross tab are numeric this should return them as characters as well). This works fine in R though. Could you please provide further guidance? Thank you. Ravi
  • Options
    The outnames option in the factory function gives the column names.
      crossTab_factory <- function()  {   inlist <- rep("int", 2)   outlist <- c("int","int","varchar")   list(name=generateCrosstab, udxtype=c("transform"), intype=inlist, outtype=outlist,outnames = c("0","1","VarName"))  }  
    However, this option seems very rigid since we have to define the column variable levels upfront. Is there any way to automate this? Thank you. Ravi
  • Options
    Hi Ravishankar, For the outnames, well, crossTab_factory is a function; it can generate the column names however it wants, they don't need to just be constants. As for generating them later (ie., after the data has been processed), SQL won't allow that: The column names are a part of the type signature of your function; it's the names of the variables that the function returns. For example, I believe you could do: SELECT VarName, "0" + "1" FROM (SELECT generateCrosstab(column1, column2) OVER() FROM testdb.table) AS my_subquery; So what are VarName, 0, and 1? Vertica has to figure that out before it can start parsing and running your query. Even without a subquery, Vertica ships the column types and names over to vsql first thing so that vsql can start preparing to display the results while the server is still computing them. For your other code: Could you try explicitly converting the remaining values to strings? If you tell Vertica that you are going to give us strings, Vertica expects that you will follow through and actually give us strings. (That's how strongly-typed languages like SQL work; not having to check the data type constantly on every value is actually a big deal for performance. Unfortunately, R doesn't natively support strong typing; it looks from your post like we have a less-than-helpful error message if you mix up types mid-resultset.) Adam

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file