Loading Large DataFrames and Arrays
Cannot Load Large Data Sets:
I am trying to build a darray for a matrix that I have already built in R, but I get the result: Error in value[[3L]](cond) : cannot allocate vector of size 885.8 Mb.
This was on a tiny matrix, only 71500 x 1001. I have to do 15 times that.
My question is: Is there a way to import a data table that is 2.5 Gb into a distributed cluster of 15 nodes without using Vertica ODBC?
I am trying to build a darray for a matrix that I have already built in R, but I get the result: Error in value[[3L]](cond) : cannot allocate vector of size 885.8 Mb.
This was on a tiny matrix, only 71500 x 1001. I have to do 15 times that.
My question is: Is there a way to import a data table that is 2.5 Gb into a distributed cluster of 15 nodes without using Vertica ODBC?
0
Comments
I would like to upload a text file in order to test out the performance of Distributed R Random Forest on my own cluster of nodes. Thanks
I would like to upload a text file in order to test out the performance of Distributed R Random Forest on my own cluster of nodes. Thanks
Thanks for reporting the issue and we want to understand the error better. Can you please provide more information regarding the cluster node and how you're loading the data? To be more specific, what's the memory available on each node and how much shared memory did you configure to use? How many executors on each node? When you build the darray from matrix, how many partitions did you use for the darray and how did you build it from the matrix? Can you please provide the code as well?
Thanks,
Kathy
The loading time actually depends on a couple of factors such as:
- Number of distributedR and Vertica nodes used
- Number of instances used in distributedR cluster (which you might set using "inst=" parameter in distributedR_start() or in your cluster configuration file(<inst> tag). If you have not specified this parameter or have given 0 as value, as many instances are started as many cores you have in the machine)
If the total number of instances started in distributedR (= sum(distributedR_status()$Inst in R) becomes too high, then the loading process might slow down as under the hood you are sending that many simultaneous queries to Vertica.
That said, ideally, the data size pointed by you should take about 1-2 minutes to load but can vary depending on the factors I mentioned above.
On a separate note, For successfuly loading, please make sure that Vertica config parameter 'MaxClientSessions' is set to a value greater than the total number of instances started in distributedR. This can be done by "select set_config_parameter('MaxClientSessions', n);", where n is greater than total number of distributedR instances.
Thanks,
Shreya