Options

# Using R's standard functions in Vertica

I've read the article "How to implement R in Vertica", attempted to make a R function, and got an error saying I didn't specify an argument (but I know I specified all arguments the fxn takes). I am really just frustrated b/c I don't understand why I can't just use R's built-in functions (like plot(), ts(), etc.). Why do I need to basically re-define a R function in Vertica?? If I don't need to then how do I implement a standard R function in Vertica w/o having to re-define it (with intypes and outtypes, etc.)?

0

## Comments

It sounds to me like R-UDx (which is the document that you read) is not what you want. It sounds like you want R-ODBC:

http://cran.r-project.org/web/packages/RODBC/index.html

(This is not a Vertica-specific feature; it's available for many different databases. If you have questions about getting it set up for Vertica, though, feel free to ask.)

R-ODBC will let you use standard R functions, etc. It's also probably much more familiar to you as an R developer.

R-UDx (which is what you're trying to use) is for when you need a lot more raw computational power than R-ODBC gives you, and you're willing to make some tradeoffs to get it.

In particular, it is able to execute code in parallel, on multiple cores within a computer, on multiple computers across a cluster all at once, and on data sets that are much too large to fit into system memory. You can't take just any code and magically make it run in parallel on many computers (for example, with plot(), which server in your Vertica cluster would display the plot?); it takes thought, sometimes a lot of thought depending on the algorithm.

The data-types business is interesting: SQL is a strongly-typed language; client programs can (and do) ask up front "what is the data type of this output column?" before actually getting any results back. Lots of visualization programs make heavy use of this ability in order to decide how to render columns. And R-UDx functions, unlike R-ODBC code, can be called from and returned as SQL values. So we need to know what type of data you're going to return before you start returning data. If you don't want to deal with that, just use R-ODBC :-)

Incidentally, "ts()": Vertica does support time-series functions in native SQL:

https://my.vertica.com/docs/7.0.x/HTML/index.htm#Authoring/ProgrammersGuide/Analytics/UsingTimeSerie...

Feel free to use either the Vertica or R version. If you have questions about the Vertica version, feel free to ask.

Adam

Last thing for now: Can you clarify the difference b/w scalar fxn and transform fxn? Although I have a background in mathematics, I'm not quite sure if my definitions match Vertica's definition. Basically, I guess I'd like to get an explanation on udxtype. One example that I was unsure of is using R's ts() function (which I see I don't need anyway since Vertica has its own way of dealing with time series data) but this is just hypothetical - would converting x number of columns from say type double to type ts be scalar or transform? I thought this would be a transform but wasn't exactly sure (since my definition of scalar involves constants, vectors, etc - not functions - perhaps I just haven't came across "scalar" functions in my studies). Also, could you actually compare common mathematical functions/transformations to the terms scalar and transform (i.e. linear, exponential function, linear transformation etc.)?

Thanks a bunch!

For having functions that take any argument type, you do need to declare multiple factories. But I believe you can re-use the actual function, at least on the R side. So you shouldn't need to duplicate much if any R code. (Disclaimer, I haven't tried this in R; I'm more of a C++ guy myself.)

Functions are first-class objects in R, and the factory is just a function. If you need to declare a lot of factory functions following a particular pattern, you could probably write a few lines of code to do so automatically. But that's getting into general R programming. We'd be glad to try to help, but the R users' mailing list would probably be a better bet.

Regarding "scalar" vs "transform": A scalar function operates on values. A transform function operates on resultsets.

A scalar function must be equivalent to a function that takes a value (or one tuple of values, in the case of multiple arguments) in at a time, and returns one different value. The data types needn't be the same, but it's still value -> value.

The current implementation passes in many values at a time and expect many values back; but that's for performance reasons ("vectorization"), it doesn't mean you're allowed to do anything that you couldn't do in the simple case where we pass data in a single value at a time.

Scalar functions must be stateless -- any given returned value must depend only on the corresponding argument value; not on previous values that the function may have seen during the query. (This constraint makes them much simpler to run in parallel.) Many useful R functions, including ts(), fail this test so are not scalar functions.

"Scalar functions", by our definition, intuitively match what you'd expect out of basic arithmetic functions. (At least, we hope they do.) Very easy to use; not terribly powerful.

Transform functions take a collection of inputs and produce a collection of outputs. They can produce multiple columns of output. The number of input columns needn't match the number of output columns. The number of input rows also needn't match the number of output rows. Hence, transforming one resultset to an arbitrarily-different resultset. Much more general; much more complicated.

In order to parallelize any operation, and therefore to make use of multiple cores and multiple nodes, it's necessary to break it up into independent units of computation. (There're a bunch of fancier formalisms behind that, but that's the casual statement.) For scalar functions this is easy -- each row is independent. So we can deal with those automatically for you.

For transform functions, most cases can be parallelized, but you have to know a little about the data. For example, the classic PageRank algorithm runs serially on a given page, but separate pages are independent. So you can't PageRank a single page any faster; but if your data were to be partitioned by page, you could easily PageRank a whole website's worth of pages simultaneously. Therefore, whenever you call a transform function, you must (in SQL) give it an "OVER" clause that describes how to break up the particular data set that you're querying such that we can parallelize the particular computation that you're requesting. (You can also request that we provide the data in sorted order, which is helpful for some algorithms; etc.) You can leave the clause empty; but then we have to assume the worst case (for example, some cryptographic hash function that must receive all values serially) and you get no more parallelism.

Anyway, there's a whole bunch of information for you. Hope it helps,

Adam