PARTITION BY with Transform UDFs in R
Hello,
I have an R UDTF in vertica which takes a simple input and returns a matrix. I am trying to run this function with different sets of parameters (preferably in parallel) using the vertica cluster.
I am trying something like,
SELECT myFunc(col1 USING PARAMETERS x=1,y=2) OVER(PARTITION BY col1) FROM mainTable;
[Vertica][VJDBC](3399) ERROR: Failure in UDx RPC call InvokeProcessPartition(): Error calling processPartition() in User Defined Object [myFunc] at [/scratch_a/release/vbuild/vertica/UDxFence/RInterface.cpp:1236], error code: 0, message: Exception in processPartitionForR: [package ‘rJava’ could not be loaded] [SQL State=VP001, DB Errorcode=3399]
I am using rJava because within the UDTF I require several other pieces of information from queries to vertica.
Some help?
I have an R UDTF in vertica which takes a simple input and returns a matrix. I am trying to run this function with different sets of parameters (preferably in parallel) using the vertica cluster.
I am trying something like,
SELECT myFunc(col1 USING PARAMETERS x=1,y=2) OVER(PARTITION BY col1) FROM mainTable;
[Vertica][VJDBC](3399) ERROR: Failure in UDx RPC call InvokeProcessPartition(): Error calling processPartition() in User Defined Object [myFunc] at [/scratch_a/release/vbuild/vertica/UDxFence/RInterface.cpp:1236], error code: 0, message: Exception in processPartitionForR: [package ‘rJava’ could not be loaded] [SQL State=VP001, DB Errorcode=3399]
I am using rJava because within the UDTF I require several other pieces of information from queries to vertica.
Some help?
0
Comments
Is the rJava package installed on Vertica's version of R? The Vertica R binary is located in /opt/vertica/R/bin/R.
First make sure you have JDK 1.4+ installed on your system and the JAVA_HOME environment variable is set correctly.
Then run the javareconf utility so R knows where Java is (you may have to be root or sudo to do this): Then run the Vertica packaged version of R and install the rJava package : Then try to run your R UDF.
Please let us know if this helps!
If i try to execute without PARTITION BY it runs succesfully.
I have removed this feature of the script and replaced its inner workings. Further research will be done and posted if new information arises.
The [vertica-udx-R] <defunct> processes is a known issue, but it doesn't interfere with the query completion. Its just that the clean up once the query is done is not done properly. The defunct processes are cleaned up once the parent exits.
Are your queries completing fine or do you get an error?
Thanks
Pratibha
We are working on fixing the defunct processes issue. We do print some messages to the logs which are a part of the normal execution. As you know R is an interpreted process, we have to parse the R library look for certain optional functions like returntypescallback and if we don't find them we just print the fact to the logs.
Pratibha
The fix will be included in the next release. I don't have an ETA for that. What version are you running? You can also run an addition dummy call to R function without partition by clause as suggested by another user here https://community.vertica.com/vertica/topics/why_r_udf_work_on_single_node_but_not_work_on_cluster?u...
On this particular matter I can't remember what we did. But If you give me the last complete error line you encountered I can help you as well.
On the side, we had to increase the JAVA memory limit value in order to give space to the query to execute. This helped us temporarily.
Can you verify that you R function do not create [vertica-udx-R] <defunct> process on the vertica nodes ?