distributedR Multiple-Machine Mode Installation
DistributedR Multiple-Machine Mode Install difficulties. I have both run single-node installation successfully on each machine and enabled password-less and prompt-less logins between all nodes with rsa keys for each on each. When I specify the cluster.xml file, any IP addresses that are not 127.0.0.1 fail.
0
Comments
Sorry that you are facing issues with multi-node installation. Can you tell us the error that shows up on the R console (does it time out after 60 seconds or are there other messages)? Can you send us the information present in :
1) log file /tmp/R_master_*, /tmp/R_worker_*, /tmp/R_executor_*
2) your cluster.xml file
Thanks,
Indrajit
distributedR_start(cluster_conf='~/cluster5.xml')
Workers registered - 1/2. Wait upto 60 seconds.
Using only 1 workers. Check with distributedR_status().
Not registered workers
server20
Check log files (/tmp/R_worker_...) in each node.
Master address:port - server18:5555
MASTER LOG:
cat /tmp/R_master_Jesse_server18.5555.log
2014-Mar-18 17:49:24.756095 [INFOR] Master node is listening at 5555 port.
2014-Mar-18 17:49:24.756636 [INFOR] Resource Manager Created
2014-Mar-18 17:49:24.756670 [INFOR] Master Initialization done
2014-Mar-18 17:49:25.784351 [INFOR] Worker 127.0.0.1:53428 registered. Number of executors: 1; Shared Memory Segment: 1794355200
2014-Mar-18 17:49:26.050786 [INFOR] Master awaiting HELLO handshaking with Workers.
2014-Mar-18 17:50:26.056559 [WARN] Only 1 workers are registered. Check with distributedR_status()
2014-Mar-18 17:50:26.056684 [INFOR] Checking non-registered workers - 127.0.0.1 (null)
2014-Mar-18 17:50:26.056722 [INFOR] Comparing with the registered worker - 127.0.0.1
2014-Mar-18 17:50:26.056756 [INFOR] Checking non-registered workers - server20 (null)
2014-Mar-18 17:50:26.056786 [INFOR] Comparing with the registered worker - 127.0.0.1
2014-Mar-18 17:50:26.056821 [INFOR] Master started.
2014-Mar-18 17:51:14.060796 [INFOR] distributedR shutdown complete.
MASTER EXECUTOR:
cat /tmp/R_executor_Jesse_server18.5555_0.log
2014-Mar-18 17:49:25.771908 [INFOR] Executor started.
2014-Mar-18 17:49:25.772235 [INFOR] Communication pipe with Worker opened on 17
Loading required package: lattice
Loading required package: Rcpp
Attaching package: ‘MatrixHelper’
The following object is masked from ‘package:Matrix’:
unpack
Loading required package: RInside
Attaching package: ‘Executor’
The following object is masked from ‘package:Matrix’:
update
The following object is masked from ‘package:stats’:
update
2014-Mar-18 17:49:28.611206 [INFOR] *** No Task under execution. Waiting from Task from Worker **
WORKER LOG:
cat /tmp/R_worker_Jesse_server18.5555.log
2014-Mar-18 17:48:36.726025 [INFOR] Starting worker.
2014-Mar-18 17:48:36.734461 [INFOR] Creating Executors in Worker
2014-Mar-18 17:48:36.734875 [INFOR] Created new Executor 0 with Process ID 3609
2014-Mar-18 17:48:36.735193 [INFOR] Created HandleRequest threads to listen requests from Master
2014-Mar-18 17:48:36.735237 [INFOR] Worker server20:2020 with 1 executors and 1794487910 Shared Memory
2014-Mar-18 17:48:36.742985 [INFOR] Creating a connection for handshake with master server18:5555
2014-Mar-18 17:48:36.743146 [INFOR] Worker opened connection to Master at server18:5555
2014-Mar-18 17:48:36.743208 [INFOR] Sending reply with worker info: server20 2020
2014-Mar-18 17:48:36.743372 [INFOR] HELLO Handshaking reply sent to Master. Master server18:5555 registered with Worker
2014-Mar-18 17:51:06.743996 [INFOR] Master node is detected to be down. Shutdown worker : elapsed time since last heartbeat: 150
2014-Mar-18 17:51:06.744145 [INFOR] Worker Shutdown triggered.
2014-Mar-18 17:51:06.746717 [INFOR] Worker shutdown - destroying executorpool
2014-Mar-18 17:51:06.749097 [INFOR] Worker shutdown - Removing shared memory segments
2014-Mar-18 17:51:06.749191 [INFOR] Worker shutdown - Removing sem lock : -1
2014-Mar-18 17:51:06.778125 [INFOR] Worker shutdown - Closing connection to other workers
2014-Mar-18 17:51:06.778370 [INFOR] Worker Shutdown complete.
WORKER EXECUTOR:
cat /tmp/R_executor_Jesse_server18.5555_0.log
2014-Mar-18 17:48:36.747222 [INFOR] Executor started.
2014-Mar-18 17:48:36.747552 [INFOR] Communication pipe with Worker opened on 17
Loading required package: lattice
Loading required package: Rcpp
Attaching package: ‘MatrixHelper’
The following object is masked from ‘package:Matrix’:
unpack
Loading required package: RInside
Attaching package: ‘Executor’
The following object is masked from ‘package:Matrix’:
update
The following object is masked from ‘package:stats’:
update
2014-Mar-18 17:48:39.592907 [INFOR] *** No Task under execution. Waiting from Task from Worker **
CLUSTER.XML:
<MasterConfig> <ServerInfo>
<Hostname>server18</Hostname>
<Port>5555</Port>
</ServerInfo>
<Workers>
<Worker>
<Hostname>127.0.0.1</Hostname>
<Port>0</Port>
</Worker>
<Worker>
<Hostname>server20</Hostname>
<Port>2020</Port>
</Worker>
</Workers>
</MasterConfig>
By looking at the logs, it seems that the worker server20 was trying to handshake with master server18 but failed. One possible cause for this is that the master machine is blocking the request into port 5555 from external machines. Can you try using other port numbers for the master (including 0)?
We also noticed an issue in the config.xml file that 127.0.0.1 is used for a worker node. This will block communications between workers because worker server20 doesn't know 127.0.0.1 is server18 and it thinks 127.0.0.1 as itself. Even though you can start the workers, there will be issues later on when you try to create distributed objects and assign values.
We also ran the cluster.xml on our cluster of machines and can start the workers successfully:
> library(distributedR)
Loading required package: Rcpp
Loading required package: RInside
> distributedR_start(cluster_conf='issue.xml')
Workers registered - 2/2.
All 2 workers are registered.
Master address:port - node1:5555
[1] TRUE
> distributedR_status()
Workers Inst SysMem MemUsed DarrayQuota DarrayUsed
1 127.0.0.1:49761 31 112863 8950 101376 0
2 node2:2020 31 112863 1527 101376 0
>
Master Log
cat /tmp/R_master_Jesse_server18.56089.log
2014-Mar-18 16:57:56.726864 [INFOR] Master node is listening at 56089 port.
2014-Mar-18 16:57:56.727454 [INFOR] Resource Manager Created
2014-Mar-18 16:57:56.727490 [INFOR] Master Initialization done
2014-Mar-18 16:57:57.772559 [INFOR] Worker 127.0.0.1:59587 registered. Number of executors: 1; Shared Memory Segment: 1794355200
2014-Mar-18 16:57:58.050669 [INFOR] Master awaiting HELLO handshaking with Workers.
2014-Mar-18 16:58:58.050779 [WARN] Only 1 workers are registered. Check with distributedR_status()
2014-Mar-18 16:58:58.050905 [INFOR] Checking non-registered workers - server18 (null)
2014-Mar-18 16:58:58.050941 [INFOR] Comparing with the registered worker - 127.0.0.1
2014-Mar-18 16:58:58.050977 [INFOR] Master started.
2014-Mar-18 16:58:58.051012 [INFOR] Checking non-registered workers - 127.0.0.1 (null)
2014-Mar-18 16:58:58.051043 [INFOR] Comparing with the registered worker - 127.0.0.1
2014-Mar-18 16:58:58.051076 [INFOR] Checking non-registered workers - server20 (null)
2014-Mar-18 16:58:58.051106 [INFOR] Comparing with the registered worker - 127.0.0.1
2014-Mar-18 16:58:58.051140 [INFOR] Master started.
2014-Mar-18 16:59:19.153019 [INFOR] distributedR shutdown complete.
Master Executor
2014-Mar-18 16:57:57.760048 [INFOR] Executor started.2014-Mar-18 16:57:57.760395 [INFOR] Communication pipe with Worker opened on 17
Loading required package: lattice
Loading required package: Rcpp
Attaching package: ‘MatrixHelper’
The following object is masked from ‘package:Matrix’:
unpack
Loading required package: RInside
Attaching package: ‘Executor’
The following object is masked from ‘package:Matrix’:
update
The following object is masked from ‘package:stats’:
update
2014-Mar-18 16:58:00.601002 [INFOR] *** No Task under execution. Waiting from Task from Worker **
Worker Log
2014-Mar-18 16:57:09.520326 [INFOR] Starting worker.2014-Mar-18 16:57:09.523133 [INFOR] Creating Executors in Worker
2014-Mar-18 16:57:09.523546 [INFOR] Created new Executor 0 with Process ID 3407
2014-Mar-18 16:57:09.523898 [INFOR] Created HandleRequest threads to listen requests from Master
2014-Mar-18 16:57:09.523955 [INFOR] Worker server20:0 with 1 executors and 1794487910 Shared Memory
2014-Mar-18 16:57:09.531126 [INFOR] Creating a connection for handshake with master server18:56089
2014-Mar-18 16:57:09.531283 [INFOR] Worker opened connection to Master at server18:56089
2014-Mar-18 16:57:09.531354 [INFOR] Sending reply with worker info: server20 61596
2014-Mar-18 16:57:09.531540 [INFOR] HELLO Handshaking reply sent to Master. Master server18:56089 registered with Worker
2014-Mar-18 16:59:39.532101 [INFOR] Master node is detected to be down. Shutdown worker : elapsed time since last heartbeat: 150
2014-Mar-18 16:59:39.532242 [INFOR] Worker Shutdown triggered.
2014-Mar-18 16:59:39.532952 [INFOR] Worker shutdown - destroying executorpool
2014-Mar-18 16:59:40.533198 [INFOR] Worker shutdown - Removing shared memory segments
2014-Mar-18 16:59:40.533297 [INFOR] Worker shutdown - Removing sem lock : -1
2014-Mar-18 16:59:40.556710 [INFOR] Worker shutdown - Closing connection to other workers
2014-Mar-18 16:59:40.556955 [INFOR] Worker Shutdown complete.
Worker Executor
2014-Mar-18 16:57:09.550176 [INFOR] Executor started.2014-Mar-18 16:57:09.556907 [INFOR] Communication pipe with Worker opened on 17
Loading required package: lattice
Loading required package: Rcpp
Attaching package: ‘MatrixHelper’
The following object is masked from ‘package:Matrix’:
unpack
Loading required package: RInside
Attaching package: ‘Executor’
The following object is masked from ‘package:Matrix’:
update
The following object is masked from ‘package:stats’:
update
2014-Mar-18 16:57:12.622203 [INFOR] *** No Task under execution. Waiting from Task from Worker **