Loading Large DataFrames and Arrays

Jesse_Glass · April 2014

Cannot Load Large Data Sets:

I am trying to build a darray for a matrix that I have already built in R, but I get the result: Error in value[[3L]](cond) : cannot allocate vector of size 885.8 Mb.

This was on a tiny matrix, only 71500 x 1001. I have to do 15 times that.

My question is: Is there a way to import a data table that is 2.5 Gb into a distributed cluster of 15 nodes without using Vertica ODBC?

Prasanta_Pal · April 2014

USE COPY command in VSQL to load the files in a table, make the table (projections) SEGMENTED

Jesse_Glass · April 2014

So setting up Distributed R requires the Vertica database?
I would like to upload a text file in order to test out the performance of Distributed R Random Forest on my own cluster of nodes. Thanks

Jesse_Glass · April 2014

So setting up Distributed R requires the Vertica database?
I would like to upload a text file in order to test out the performance of Distributed R Random Forest on my own cluster of nodes. Thanks

[Deleted User] · May 2014

Hi Jesse,

Thanks for reporting the issue and we want to understand the error better. Can you please provide more information regarding the cluster node and how you're loading the data? To be more specific, what's the memory available on each node and how much shared memory did you configure to use? How many executors on each node? When you build the darray from matrix, how many partitions did you use for the darray and how did you build it from the matrix? Can you please provide the code as well?

Thanks,
Kathy

Jesse_Glass · May 2014

I simply had loaded a data set into R and then attempted to load that file into a darray. I have now set up a Vertica DB and vRODBC. My question is how long it should take to upload a table which is 1,000,000 by 1,002 from a single node VDB to a 14 node cluster? Thanks.

[Deleted User] · May 2014

Hi Jesse,

The loading time actually depends on a couple of factors such as:
- Number of distributedR and Vertica nodes used
- Number of instances used in distributedR cluster (which you might set using "inst=" parameter in distributedR_start() or in your cluster configuration file(<inst> tag). If you have not specified this parameter or have given 0 as value, as many instances are started as many cores you have in the machine)

If the total number of instances started in distributedR (= sum(distributedR_status()$Inst in R) becomes too high, then the loading process might slow down as under the hood you are sending that many simultaneous queries to Vertica.

That said, ideally, the data size pointed by you should take about 1-2 minutes to load but can vary depending on the factors I mentioned above.

On a separate note, For successfuly loading, please make sure that Vertica config parameter 'MaxClientSessions' is set to a value greater than the total number of instances started in distributedR. This can be done by "select set_config_parameter('MaxClientSessions', n);", where n is greater than total number of distributedR instances.

Thanks,
Shreya

We're Moving!

Create My New Community Account Now

Loading Large DataFrames and Arrays

Comments

Leave a Comment