Worker is down because it says Master node is detected to be down

In my R interface, this result generates:
"A worker (192.168.100.24:2020) is down"

When looking at the Worker log though we get:
"Master node is detected to be down. Shutdown worker : elapsed time since last heartbeat: 150"

I have made sure to synchronize the clocks, but otherwise am at a loss.


*****************WORKER LOG**************
2014-May-06 13:34:34.355867 [INFOR] Starting worker.
2014-May-06 13:34:34.357904 [INFOR] Creating Executors in Worker
2014-May-06 13:34:34.358200 [INFOR] Created new Executor 0 with Process ID 19459
2014-May-06 13:34:34.358432 [INFOR] Created new Executor 1 with Process ID 19460
2014-May-06 13:34:34.358727 [INFOR] Created HandleRequest threads to listen requests from Master
2014-May-06 13:34:34.358776 [INFOR] Worker server24:2020 with 2 executors and 1794495283 Shared Memory
2014-May-06 13:34:34.366709 [INFOR] Creating a connection for handshake with master 192.168.100.18:5555
2014-May-06 13:34:34.366863 [INFOR] Worker opened connection to Master at 192.168.100.18:5555
2014-May-06 13:34:34.366927 [INFOR] Sending reply with worker info: 192.168.100.24 2020
2014-May-06 13:34:34.367092 [INFOR] HELLO Handshaking reply sent to Master. Master 192.168.100.18:5555 registered with Worker
2014-May-06 13:37:04.367667 [INFOR] Master node is detected to be down. Shutdown worker : elapsed time since last heartbeat: 150
2014-May-06 13:37:04.367816 [INFOR] Worker Shutdown triggered.
2014-May-06 13:37:04.368557 [INFOR] Worker shutdown - destroying executorpool
2014-May-06 13:37:05.368795 [INFOR] Worker shutdown - Removing shared memory segments
2014-May-06 13:37:05.368899 [INFOR] Worker shutdown - Removing sem lock : -1
2014-May-06 13:37:05.386797 [INFOR] Worker shutdown - Closing connection to other workers
2014-May-06 13:37:05.387077 [INFOR] Worker Shutdown complete.


*****************MASTER LOG***************
2014-May-06 13:34:33.283525 [INFOR] Master node is listening at 5555 port.
2014-May-06 13:34:33.284014 [INFOR] Resource Manager Created
2014-May-06 13:34:33.284047 [INFOR] Master Initialization done
2014-May-06 13:34:34.316574 [INFOR] Worker 192.168.100.19:9090 registered. Number of executors: 2; Shared Memory Segment: 1794347827
2014-May-06 13:34:34.328546 [INFOR] Worker 192.168.100.20:2020 registered. Number of executors: 2; Shared Memory Segment: 1794421555
2014-May-06 13:34:34.342760 [INFOR] Worker 192.168.100.21:2020 registered. Number of executors: 2; Shared Memory Segment: 1794495283
2014-May-06 13:34:34.358314 [INFOR] Worker 192.168.100.23:2020 registered. Number of executors: 2; Shared Memory Segment: 1794495283
2014-May-06 13:34:34.368039 [INFOR] Worker 192.168.100.24:2020 registered. Number of executors: 2; Shared Memory Segment: 1794495283
2014-May-06 13:34:34.383680 [INFOR] Worker 192.168.100.22:2020 registered. Number of executors: 2; Shared Memory Segment: 1794495283
2014-May-06 13:34:34.630371 [INFOR] Master awaiting HELLO handshaking with Workers.
2014-May-06 13:34:34.630534 [INFOR] All 6 workers are registered. Master Started.
2014-May-06 13:34:41.352571 [INFOR] *** A new Foreach function received. 36 parallel EXECUTE tasks will be created and sent to Worker :
function() {
    dhs <- array(data = as.numeric(x), dim = dobject.getdims(fulldim, blockdim, ii))
    update(dhs)
}
2014-May-06 13:34:41.352666 [INFOR] Foreach arguments has no dependencies. Sending task to Worker for execution
2014-May-06 13:34:41.352750 [INFOR] EXECUTE TaskID              3 - Sent to Worker 192.168.100.19
2014-May-06 13:34:41.352849 [INFOR] EXECUTE TaskID              5 - Sent to Worker 192.168.100.20
2014-May-06 13:34:41.352934 [INFOR] EXECUTE TaskID              7 - Sent to Worker 192.168.100.21
2014-May-06 13:34:41.352984 [INFOR] EXECUTE TaskID              9 - Sent to Worker 192.168.100.23
2014-May-06 13:34:41.353066 [INFOR] EXECUTE TaskID             11 - Sent to Worker 192.168.100.24
2014-May-06 13:34:41.353142 [INFOR] EXECUTE TaskID             13 - Sent to Worker 192.168.100.22
2014-May-06 13:34:41.353191 [INFOR] EXECUTE TaskID             15 - Sent to Worker 192.168.100.19
2014-May-06 13:34:41.353239 [INFOR] EXECUTE TaskID             17 - Sent to Worker 192.168.100.20
2014-May-06 13:34:41.353291 [INFOR] EXECUTE TaskID             19 - Sent to Worker 192.168.100.21
2014-May-06 13:34:41.353363 [INFOR] EXECUTE TaskID             21 - Sent to Worker 192.168.100.23
2014-May-06 13:34:41.353428 [INFOR] EXECUTE TaskID             23 - Sent to Worker 192.168.100.24
2014-May-06 13:34:41.353497 [INFOR] EXECUTE TaskID             25 - Sent to Worker 192.168.100.22
2014-May-06 13:34:41.353544 [INFOR] EXECUTE TaskID             27 - Sent to Worker 192.168.100.19
2014-May-06 13:34:41.353593 [INFOR] EXECUTE TaskID             29 - Sent to Worker 192.168.100.20
2014-May-06 13:34:41.353637 [INFOR] EXECUTE TaskID             31 - Sent to Worker 192.168.100.21
2014-May-06 13:34:41.353685 [INFOR] EXECUTE TaskID             33 - Sent to Worker 192.168.100.23
2014-May-06 13:34:41.353730 [INFOR] EXECUTE TaskID             35 - Sent to Worker 192.168.100.24
2014-May-06 13:34:41.353774 [INFOR] EXECUTE TaskID             37 - Sent to Worker 192.168.100.22
2014-May-06 13:34:41.353818 [INFOR] EXECUTE TaskID             39 - Sent to Worker 192.168.100.19
2014-May-06 13:34:41.353863 [INFOR] EXECUTE TaskID             41 - Sent to Worker 192.168.100.20
2014-May-06 13:34:41.353908 [INFOR] EXECUTE TaskID             43 - Sent to Worker 192.168.100.21
2014-May-06 13:34:41.353951 [INFOR] EXECUTE TaskID             45 - Sent to Worker 192.168.100.23
2014-May-06 13:34:41.354010 [INFOR] EXECUTE TaskID             47 - Sent to Worker 192.168.100.24
2014-May-06 13:34:41.354055 [INFOR] EXECUTE TaskID             49 - Sent to Worker 192.168.100.22
2014-May-06 13:34:41.354095 [INFOR] EXECUTE TaskID             51 - Sent to Worker 192.168.100.19
2014-May-06 13:34:41.354136 [INFOR] EXECUTE TaskID             53 - Sent to Worker 192.168.100.20
2014-May-06 13:34:41.354176 [INFOR] EXECUTE TaskID             55 - Sent to Worker 192.168.100.21
2014-May-06 13:34:41.354226 [INFOR] EXECUTE TaskID             57 - Sent to Worker 192.168.100.23
2014-May-06 13:34:41.354267 [INFOR] EXECUTE TaskID             59 - Sent to Worker 192.168.100.24
2014-May-06 13:34:41.354307 [INFOR] EXECUTE TaskID             61 - Sent to Worker 192.168.100.22
2014-May-06 13:34:41.354346 [INFOR] EXECUTE TaskID             63 - Sent to Worker 192.168.100.19
2014-May-06 13:34:41.354386 [INFOR] EXECUTE TaskID             65 - Sent to Worker 192.168.100.20
2014-May-06 13:34:41.354430 [INFOR] EXECUTE TaskID             67 - Sent to Worker 192.168.100.21
2014-May-06 13:34:41.354471 [INFOR] EXECUTE TaskID             69 - Sent to Worker 192.168.100.23
2014-May-06 13:34:41.354511 [INFOR] EXECUTE TaskID             71 - Sent to Worker 192.168.100.24
2014-May-06 13:34:41.354550 [INFOR] EXECUTE TaskID             73 - Sent to Worker 192.168.100.22
2014-May-06 13:36:34.631306 [ERROR] ResourceManager shutting down - 192.168.100.24 age is 120263129
2014-May-06 13:36:34.631436 [ERROR] ResourceManager shutting down - 192.168.100.23 age is 120273023
2014-May-06 13:36:34.631477 [ERROR] ResourceManager shutting down - 192.168.100.20 age is 120302833
2014-May-06 13:36:34.631511 [ERROR] ResourceManager shutting down - 192.168.100.21 age is 120288655
2014-May-06 13:36:34.631545 [ERROR] ResourceManager shutting down - 192.168.100.19 age is 120314830
2014-May-06 13:36:34.631578 [ERROR] ResourceManager shutting down - 192.168.100.22 age is 120247804
2014-May-06 13:36:34.631630 [WARN] A worker (192.168.100.24:2020) is down
2014-May-06 13:36:34.631666 [ERROR] distributedR will shutdown
2014-May-06 13:36:35.633484 [INFOR] distributedR shutdown complete.



Comments

  • Hi Jesse,

    The log file says that the initial handshake message was successful, however for some reason, the master and workers fail to communicate after the handshake. To see the log messages in-detail, can you please run the same commands  with the DEBUG log level? You can call distributedR_start(..., log=3) to enable more logging.
    In addition to that, can you please check if this problem happens when a distributedR cluster is formed with a single worker node? For example, the master address is set as 127.0.0.1 and a worker address is 127.0.0.1 with a different port number.
  • The problem does not occur with a single worker node when the worker and master are on the same node.

    My only guess is that the Master port is tcp enabled and not udp. Although that is a stab in the dark.

    Here is the log (this time node 23 shut down instead of 24):

    2014-May-08 09:22:53.144370 [INFOR] Starting worker.
    2014-May-08 09:22:53.151684 [INFOR] Creating Executors in Worker
    2014-May-08 09:22:53.152014 [INFOR] Created new Executor 0 with Process ID 24344
    2014-May-08 09:22:53.152244 [INFOR] Created new Executor 1 with Process ID 24345
    2014-May-08 09:22:53.152531 [INFOR] Created HandleRequest threads to listen requests from Master
    2014-May-08 09:22:53.152581 [INFOR] Worker server23:2020 with 2 executors and 1794495283 Shared Memory
    2014-May-08 09:22:53.163959 [INFOR] Creating a connection for handshake with master 192.168.100.18:5555
    2014-May-08 09:22:53.164116 [INFOR] Worker opened connection to Master at 192.168.100.18:5555
    2014-May-08 09:22:53.164182 [INFOR] Sending reply with worker info: 192.168.100.23 2020
    2014-May-08 09:22:53.164347 [INFOR] HELLO Handshaking reply sent to Master. Master 192.168.100.18:5555 registered with Worker
    2014-May-08 09:22:53.190934 [DEBUG] Connected to master at tcp://192.168.100.18:5555
    2014-May-08 09:25:23.170804 [INFOR] Master node is detected to be down. Shutdown worker : elapsed time since last heartbeat: 150
    2014-May-08 09:25:23.170945 [INFOR] Worker Shutdown triggered.
    2014-May-08 09:25:23.171033 [DEBUG] Total MB fetched:    0.00 MB
    Total fetch time:    0.00 s
    Total MB   sent:    0.00 MB
    Total send time:    0.00 s
    Total cc time:    0.00 s
    2014-May-08 09:25:23.171071 [DEBUG] PrestoWorker shutdown - joining threads
    2014-May-08 09:25:23.171114 [DEBUG] PrestoWorker shutdown - joining threads for 0:0
    2014-May-08 09:25:23.171326 [DEBUG] PrestoWorker shutdown - joining threads for 0:1
    2014-May-08 09:25:23.171406 [DEBUG] PrestoWorker shutdown - joining threads for 1:0
    2014-May-08 09:25:23.171451 [DEBUG] PrestoWorker shutdown - joining threads for 2:0
    2014-May-08 09:25:23.171551 [DEBUG] PrestoWorker shutdown - joining threads for 2:1
    2014-May-08 09:25:23.171605 [DEBUG] PrestoWorker shutdown - joining threads for 2:2
    2014-May-08 09:25:23.171652 [DEBUG] PrestoWorker shutdown - joining threads for 2:3
    2014-May-08 09:25:23.171703 [DEBUG] PrestoWorker shutdown - joining threads for 3:0
    2014-May-08 09:25:23.171788 [DEBUG] PrestoWorker shutdown - joining threads for 3:1
    2014-May-08 09:25:23.171839 [DEBUG] PrestoWorker shutdown - joining threads for 3:2
    2014-May-08 09:25:23.171886 [DEBUG] PrestoWorker shutdown - joining threads for 3:3
    2014-May-08 09:25:23.171940 [DEBUG] PrestoWorker shutdown - joining threads for 4:0
    2014-May-08 09:25:23.171993 [DEBUG] PrestoWorker shutdown - joining threads for 5:0
    2014-May-08 09:25:23.172043 [DEBUG] PrestoWorker shutdown - joining threads for 5:1
    2014-May-08 09:25:23.172089 [INFOR] Worker shutdown - destroying executorpool
    2014-May-08 09:25:23.172140 [DEBUG] Executorpool destructor: Waiting for child processes to join
    2014-May-08 09:25:24.172292 [DEBUG] ExecutorPool desctructor: All child processes are terminated
    2014-May-08 09:25:24.172408 [DEBUG] Executorpool destructor: Closing pipe descriptor
    2014-May-08 09:25:24.172474 [DEBUG] Executorpool destructor: Closing semaphores
    2014-May-08 09:25:24.172511 [DEBUG] Executorpool destructor: Removing shared memory segments
    2014-May-08 09:25:24.172546 [INFOR] Worker shutdown - Removing shared memory segments
    2014-May-08 09:25:24.172581 [INFOR] Worker shutdown - Removing sem lock : -1
    2014-May-08 09:25:24.313199 [INFOR] Worker shutdown - Closing connection to other workers
    2014-May-08 09:25:24.313460 [INFOR] Worker Shutdown complete.



  • Thanks Jesse. We are not using UDP messages, but if possible can you please disable firewall and rerun them? For the log message, it would be nicer if you also paste a master log messages.
  • Disabling the firewall solve the issue. Thanks!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file