distributedR_start() problem

kimchitsigai · April 2014

Hello all,

I've installed the 0.5 version of DistributedR on CentOS 6.5.
I have one Master and four Workers installed as VMware VMs.

Starting DistributedR in Single Machine Mode works.

Starting DistributedR in Multiple Machine Mode fails.

The output of R is :
> distributedR_start(cluster_conf="/home/jin/dR/cluster.xml", log=3)Workers registered - 0/4. Wait upto 60 seconds.
Shutdown complete
Error in value[[3L]](cond) : No workers are registered

The Master log is :
2014-Apr-13 10:37:13.029405 [INFOR] Master node is listening at 8989 port.2014-Apr-13 10:37:13.029981 [INFOR] Resource Manager Created
2014-Apr-13 10:37:13.030012 [INFOR] Master Initialization done
2014-Apr-13 10:37:14.306944 [INFOR] Master awaiting HELLO handshaking with Workers.
2014-Apr-13 10:38:14.307571 [ERROR] No workers are registered
2014-Apr-13 10:38:14.310596 [DEBUG] Sending Shutdown message to Workers.
2014-Apr-13 10:38:15.315530 [DEBUG] Killed Master Message Handler
2014-Apr-13 10:38:15.315784 [DEBUG] Killed Scheduler
2014-Apr-13 10:38:15.315899 [INFOR] distributedR shutdown complete.

The Worker log is :
2014-Apr-13 10:37:14.119284 [INFOR] Starting worker.2014-Apr-13 10:37:14.121764 [INFOR] Creating Executors in Worker
2014-Apr-13 10:37:14.122062 [INFOR] Created new Executor 0 with Process ID 2978
2014-Apr-13 10:37:14.122525 [INFOR] Created new Executor 1 with Process ID 2979
2014-Apr-13 10:37:14.123041 [INFOR] Created new Executor 2 with Process ID 2980
2014-Apr-13 10:37:14.123560 [INFOR] Created new Executor 3 with Process ID 2981
2014-Apr-13 10:37:14.129698 [INFOR] Created HandleRequest threads to listen requests from Master
2014-Apr-13 10:37:14.129752 [INFOR] Worker centos2.localdomain:9090 with 4 executors and 1804647628 Shared Memory
2014-Apr-13 10:37:14.176329 [INFOR] Creating a connection for handshake with master centos1:8989
2014-Apr-13 10:37:14.176521 [INFOR] Worker opened connection to Master at centos1:8989
2014-Apr-13 10:37:14.176618 [INFOR] Sending reply with worker info: centos2 9090
2014-Apr-13 10:37:14.176776 [INFOR] HELLO Handshaking reply sent to Master. Master centos1:8989 registered with Worker
2014-Apr-13 10:37:14.228034 [DEBUG] Connected to master at tcp://centos1:8989
2014-Apr-13 10:39:44.195226 [INFOR] Master node is detected to be down. Shutdown worker : elapsed time since last heartbeat: 150
2014-Apr-13 10:39:44.195568 [INFOR] Worker Shutdown triggered.
2014-Apr-13 10:39:44.195850 [DEBUG] Total MB fetched: 0.00 MB
Total fetch time: 0.00 s
Total MB sent: 0.00 MBTotal send time: 0.00 s
Total cc time: 0.00 s
2014-Apr-13 10:39:44.195931 [DEBUG] PrestoWorker shutdown - joining threads
2014-Apr-13 10:39:44.197261 [DEBUG] PrestoWorker shutdown - joining threads for 0:0
2014-Apr-13 10:39:44.197416 [DEBUG] PrestoWorker shutdown - joining threads for 0:1
2014-Apr-13 10:39:44.197485 [DEBUG] PrestoWorker shutdown - joining threads for 0:2
2014-Apr-13 10:39:44.197538 [DEBUG] PrestoWorker shutdown - joining threads for 0:3
2014-Apr-13 10:39:44.197842 [DEBUG] PrestoWorker shutdown - joining threads for 1:0
2014-Apr-13 10:39:44.198115 [DEBUG] PrestoWorker shutdown - joining threads for 2:0

The cluster.xml file is :
<MasterConfig> <ServerInfo>
<Hostname>centos1</Hostname>
<Port>8989</Port>
</ServerInfo>
<Workers>
<Worker>
<Hostname>centos2</Hostname>
<Port>9090</Port>
<Executors>4</Executors>
<SharedMemory>0</SharedMemory>
</Worker>
<Worker>
<Hostname>centos3</Hostname>
<Port>9090</Port>
<Executors>4</Executors>
<SharedMemory>0</SharedMemory>
</Worker>
<Worker>
<Hostname>centos4</Hostname>
<Port>9090</Port>
<Executors>4</Executors>
<SharedMemory>0</SharedMemory>
</Worker>
<Worker>
<Hostname>centos5</Hostname>
<Port>9090</Port>
<Executors>4</Executors>
<SharedMemory>0</SharedMemory>
</Worker>
</Workers>
</MasterConfig>

Thank you for your help,
Jin

Nimmi_gupta · April 2014

Can you check, Is all the machines in the cluster have set for password-less and prompt-less login to one-another. Also, each machine in the cluster should have password-less and prompt-less login for the command “ssh 127.0.0.1”. If not, I would suggest for enabling password-less and promptless-less
login first and try to start distributedR in multiple node.

Kyungyong_Lee · April 2014

In addition to Nimmi's suggestion, can you also check the firewall setup? At least the port number of master and worker node has to be allowed to pass through the firewall.

kimchitsigai · April 2014

Hi Nimmi and Kyungyong,
Thank you for your answers.

I have promptless and passwordless access for ssh between any couple of machines, centos1 to centos2.
I have promptless and passwordless access for ssh to 127.0.0.1
I have turned off iptables on all the 5 machines.

I've noticed in the logs that the worker node seems to send its HELLO handshake to the master before the master listens to it.

Best,
Jin

Kyungyong_Lee · April 2014

Thanks for checking that, Jin. The log files can have discrepancy in the printed time as they are running on different machines, and the worker process is started through ssh after confirming a master is listening at a port.
If iptable is disabled, can you please check selinux setup also? Based on the log file, a worker is assumed that it cannot make a connection to a master.

We're Moving!

Create My New Community Account Now

distributedR_start() problem

Comments

Leave a Comment