distributedR Multiple-Machine Mode Installation

DistributedR Multiple-Machine Mode Install difficulties. I have both run single-node installation successfully on each machine and enabled password-less and prompt-less logins between all nodes with rsa keys for each on each. When I specify the cluster.xml file, any IP addresses that are not 127.0.0.1 fail.

Comments

  • Can you please paste the cluster.xml file you specified?
  • Hello Jesse,
       Sorry that you are facing issues with multi-node installation. Can you tell us the error that shows up on the R console (does it time out after 60 seconds or are there other messages)? Can you send us the information present in :
    1) log file /tmp/R_master_*, /tmp/R_worker_*, /tmp/R_executor_*
    2) your cluster.xml file

    Thanks,
    Indrajit
  • R CONSOLE:

    distributedR_start(cluster_conf='~/cluster5.xml')
    Workers registered - 1/2. Wait upto 60 seconds.
    Using only 1 workers. Check with distributedR_status().

    Not registered workers
    server20
    Check log files (/tmp/R_worker_...) in each node.
    Master address:port - server18:5555


    MASTER LOG:
    cat /tmp/R_master_Jesse_server18.5555.log
    2014-Mar-18 17:49:24.756095 [INFOR] Master node is listening at 5555 port.
    2014-Mar-18 17:49:24.756636 [INFOR] Resource Manager Created
    2014-Mar-18 17:49:24.756670 [INFOR] Master Initialization done
    2014-Mar-18 17:49:25.784351 [INFOR] Worker 127.0.0.1:53428 registered. Number of executors: 1; Shared Memory Segment: 1794355200
    2014-Mar-18 17:49:26.050786 [INFOR] Master awaiting HELLO handshaking with Workers.
    2014-Mar-18 17:50:26.056559 [WARN] Only 1 workers are registered. Check with distributedR_status()
    2014-Mar-18 17:50:26.056684 [INFOR] Checking non-registered workers - 127.0.0.1 (null)
    2014-Mar-18 17:50:26.056722 [INFOR] Comparing with the registered worker - 127.0.0.1

    2014-Mar-18 17:50:26.056756 [INFOR] Checking non-registered workers - server20 (null)
    2014-Mar-18 17:50:26.056786 [INFOR] Comparing with the registered worker - 127.0.0.1

    2014-Mar-18 17:50:26.056821 [INFOR] Master started.
    2014-Mar-18 17:51:14.060796 [INFOR] distributedR shutdown complete.


    MASTER EXECUTOR:
    cat /tmp/R_executor_Jesse_server18.5555_0.log
    2014-Mar-18 17:49:25.771908 [INFOR] Executor started.
    2014-Mar-18 17:49:25.772235 [INFOR] Communication pipe with Worker opened on 17
    Loading required package: lattice
    Loading required package: Rcpp

    Attaching package: ‘MatrixHelper’

    The following object is masked from ‘package:Matrix’:

        unpack

    Loading required package: RInside

    Attaching package: ‘Executor’

    The following object is masked from ‘package:Matrix’:

        update

    The following object is masked from ‘package:stats’:

        update

    2014-Mar-18 17:49:28.611206 [INFOR] *** No Task under execution. Waiting from Task from Worker **


    WORKER LOG:
    cat /tmp/R_worker_Jesse_server18.5555.log
    2014-Mar-18 17:48:36.726025 [INFOR] Starting worker.
    2014-Mar-18 17:48:36.734461 [INFOR] Creating Executors in Worker
    2014-Mar-18 17:48:36.734875 [INFOR] Created new Executor 0 with Process ID 3609
    2014-Mar-18 17:48:36.735193 [INFOR] Created HandleRequest threads to listen requests from Master
    2014-Mar-18 17:48:36.735237 [INFOR] Worker server20:2020 with 1 executors and 1794487910 Shared Memory
    2014-Mar-18 17:48:36.742985 [INFOR] Creating a connection for handshake with master server18:5555
    2014-Mar-18 17:48:36.743146 [INFOR] Worker opened connection to Master at server18:5555
    2014-Mar-18 17:48:36.743208 [INFOR] Sending reply with worker info: server20 2020
    2014-Mar-18 17:48:36.743372 [INFOR] HELLO Handshaking reply sent to Master. Master server18:5555 registered with Worker
    2014-Mar-18 17:51:06.743996 [INFOR] Master node is detected to be down. Shutdown worker : elapsed time since last heartbeat: 150
    2014-Mar-18 17:51:06.744145 [INFOR] Worker Shutdown triggered.
    2014-Mar-18 17:51:06.746717 [INFOR] Worker shutdown - destroying executorpool
    2014-Mar-18 17:51:06.749097 [INFOR] Worker shutdown - Removing shared memory segments
    2014-Mar-18 17:51:06.749191 [INFOR] Worker shutdown - Removing sem lock : -1
    2014-Mar-18 17:51:06.778125 [INFOR] Worker shutdown - Closing connection to other workers
    2014-Mar-18 17:51:06.778370 [INFOR] Worker Shutdown complete.


    WORKER EXECUTOR:
    cat /tmp/R_executor_Jesse_server18.5555_0.log
    2014-Mar-18 17:48:36.747222 [INFOR] Executor started.
    2014-Mar-18 17:48:36.747552 [INFOR] Communication pipe with Worker opened on 17
    Loading required package: lattice
    Loading required package: Rcpp

    Attaching package: ‘MatrixHelper’

    The following object is masked from ‘package:Matrix’:

        unpack

    Loading required package: RInside

    Attaching package: ‘Executor’

    The following object is masked from ‘package:Matrix’:

        update

    The following object is masked from ‘package:stats’:

        update

    2014-Mar-18 17:48:39.592907 [INFOR] *** No Task under execution. Waiting from Task from Worker **



    CLUSTER.XML:

    <MasterConfig>  <ServerInfo>
        <Hostname>server18</Hostname>
        <Port>5555</Port>
      </ServerInfo>
      <Workers>
        <Worker>
          <Hostname>127.0.0.1</Hostname>
          <Port>0</Port>
        </Worker>
        <Worker>
          <Hostname>server20</Hostname>
          <Port>2020</Port>
        </Worker>
      </Workers>
    </MasterConfig>




  • Thank you Jesse for the information.

    By looking at the logs, it seems that the worker server20 was trying to handshake with master server18 but failed.  One possible cause for this is that the master machine is blocking the request into port 5555 from external machines.  Can you try using other port numbers for the master (including 0)?

    We also noticed an issue in the config.xml file that 127.0.0.1 is used for a worker node.  This will block communications between workers because worker server20 doesn't know 127.0.0.1 is server18 and it thinks 127.0.0.1 as itself. Even though you can start the workers, there will be issues later on when you try to create distributed objects and assign values.

    We also ran the cluster.xml on our cluster of machines and can start the workers successfully:
    > library(distributedR)
    Loading required package: Rcpp
    Loading required package: RInside
    > distributedR_start(cluster_conf='issue.xml')
    Workers registered - 2/2.
    All 2 workers are registered.
    Master address:port - node1:5555
    [1] TRUE
    > distributedR_status()
              Workers Inst SysMem MemUsed DarrayQuota DarrayUsed
    1 127.0.0.1:49761   31 112863    8950      101376          0
    2     node2:2020   31 112863    1527      101376          0


  • Thank You. Port 5555 wasn't listening. And when I chose Port 0, it still wouldn't work. After I opened port 5555, it worked. Thank you.
  • Thank you Jesse.  Glad to know that it's working now.  Just curious about port 0 still not working.  Can you please paste the log for port 0?  Thanks.
  • I believe that even when choosing port 0 that the computer was selecting a port that was inaccessible to the worker nodes.


    Master Log
    cat /tmp/R_master_Jesse_server18.56089.log
    2014-Mar-18 16:57:56.726864 [INFOR] Master node is listening at 56089 port.
    2014-Mar-18 16:57:56.727454 [INFOR] Resource Manager Created
    2014-Mar-18 16:57:56.727490 [INFOR] Master Initialization done
    2014-Mar-18 16:57:57.772559 [INFOR] Worker 127.0.0.1:59587 registered. Number of executors: 1; Shared Memory Segment: 1794355200
    2014-Mar-18 16:57:58.050669 [INFOR] Master awaiting HELLO handshaking with Workers.
    2014-Mar-18 16:58:58.050779 [WARN] Only 1 workers are registered. Check with distributedR_status()
    2014-Mar-18 16:58:58.050905 [INFOR] Checking non-registered workers - server18 (null)
    2014-Mar-18 16:58:58.050941 [INFOR] Comparing with the registered worker - 127.0.0.1

    2014-Mar-18 16:58:58.050977 [INFOR] Master started.
    2014-Mar-18 16:58:58.051012 [INFOR] Checking non-registered workers - 127.0.0.1 (null)
    2014-Mar-18 16:58:58.051043 [INFOR] Comparing with the registered worker - 127.0.0.1

    2014-Mar-18 16:58:58.051076 [INFOR] Checking non-registered workers - server20 (null)
    2014-Mar-18 16:58:58.051106 [INFOR] Comparing with the registered worker - 127.0.0.1

    2014-Mar-18 16:58:58.051140 [INFOR] Master started.
    2014-Mar-18 16:59:19.153019 [INFOR] distributedR shutdown complete.

    Master Executor

    2014-Mar-18 16:57:57.760048 [INFOR] Executor started.2014-Mar-18 16:57:57.760395 [INFOR] Communication pipe with Worker opened on 17
    Loading required package: lattice
    Loading required package: Rcpp

    Attaching package: ‘MatrixHelper’

    The following object is masked from ‘package:Matrix’:

        unpack

    Loading required package: RInside

    Attaching package: ‘Executor’

    The following object is masked from ‘package:Matrix’:

        update

    The following object is masked from ‘package:stats’:

        update

    2014-Mar-18 16:58:00.601002 [INFOR] *** No Task under execution. Waiting from Task from Worker **



    Worker Log
    2014-Mar-18 16:57:09.520326 [INFOR] Starting worker.2014-Mar-18 16:57:09.523133 [INFOR] Creating Executors in Worker
    2014-Mar-18 16:57:09.523546 [INFOR] Created new Executor 0 with Process ID 3407
    2014-Mar-18 16:57:09.523898 [INFOR] Created HandleRequest threads to listen requests from Master
    2014-Mar-18 16:57:09.523955 [INFOR] Worker server20:0 with 1 executors and 1794487910 Shared Memory
    2014-Mar-18 16:57:09.531126 [INFOR] Creating a connection for handshake with master server18:56089
    2014-Mar-18 16:57:09.531283 [INFOR] Worker opened connection to Master at server18:56089
    2014-Mar-18 16:57:09.531354 [INFOR] Sending reply with worker info: server20 61596
    2014-Mar-18 16:57:09.531540 [INFOR] HELLO Handshaking reply sent to Master. Master server18:56089 registered with Worker
    2014-Mar-18 16:59:39.532101 [INFOR] Master node is detected to be down. Shutdown worker : elapsed time since last heartbeat: 150
    2014-Mar-18 16:59:39.532242 [INFOR] Worker Shutdown triggered.
    2014-Mar-18 16:59:39.532952 [INFOR] Worker shutdown - destroying executorpool
    2014-Mar-18 16:59:40.533198 [INFOR] Worker shutdown - Removing shared memory segments
    2014-Mar-18 16:59:40.533297 [INFOR] Worker shutdown - Removing sem lock : -1
    2014-Mar-18 16:59:40.556710 [INFOR] Worker shutdown - Closing connection to other workers
    2014-Mar-18 16:59:40.556955 [INFOR] Worker Shutdown complete.


    Worker Executor

    2014-Mar-18 16:57:09.550176 [INFOR] Executor started.2014-Mar-18 16:57:09.556907 [INFOR] Communication pipe with Worker opened on 17
    Loading required package: lattice
    Loading required package: Rcpp

    Attaching package: ‘MatrixHelper’

    The following object is masked from ‘package:Matrix’:

        unpack

    Loading required package: RInside

    Attaching package: ‘Executor’

    The following object is masked from ‘package:Matrix’:

        update

    The following object is masked from ‘package:stats’:

        update

    2014-Mar-18 16:57:12.622203 [INFOR] *** No Task under execution. Waiting from Task from Worker **






  • Thank you Jesse. Yes it looks the same issue and the dynamic port chosen is not open to the workers.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file