Replication To DR Stops when 1 out of 3 node goes down
We have 3 node cluster at primary site, 3 node cluster at disaster site. normally replication was working fine by VBR utility. lately it is observed that node 2 in disaster cluster goes down randomly due to which replication stops while two other nodes are up and running .
For some reasons, Node2 is losing connection with Node1 and Node3.
for example, around 04:37:36 on June 12th:
Node1: 2020-06-12 04:37:36.599 Spread Client:7ffa697fb700 [Comms] NETWORK change with 1 VS sets
Node3: 2020-06-12 04:37:36.603 Spread Client:7fa2717fb700 [Comms] NETWORK change with 1 VS sets
Node3: 2020-06-12 04:37:36.603 Spread Client:7fa2717fb700 [Comms] nodeSetNotifier: node v_ossa_node0002 left the cluster
Node2: 2020-06-12 04:37:36.786 EEThread:7f24cf5fe700-a00000095f920a [EE] Recv: Message receipt from v_ossa_node0001 failed  handle=MultiplexedRecvHandle (0x7f2604008560) (10.126.217.59:5434)tag 1001 cancelId 0xa00000095fa806 CANCELED
Node1: 2020-06-12 04:37:36.601 Spread Client:7ffa697fb700 [Comms] nodeSetNotifier: node v_ossa_node0002 left the cluster
Node2: 2020-06-12 04:37:47.989 Spread Client:7f258f902700 [Comms] nodeSetNotifier: node v_ossa_node0001 left the cluster
Node2: 2020-06-12 04:37:48.029 Spread Client:7f258f902700 [Comms] nodeSetNotifier: node v_ossa_node0003 left the cluster
Node2: 2020-06-12 04:37:48.029 Spread Client:7f258f902700 [Recover] Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
Node2: 2020-06-12 04:37:48.029 Spread Client:7f258f902700 [Recover] Setting node v_ossa_node0002 to UNSAFE
The message of “ NETWORK change with 1 VS sets” means that the the network group changed probably because Node1 and Node3 did not receive the membership message from one of the Node2.
As a consequence Node1 and Node3 remove Node2 from the cluster and similarly Node2 removes Node1 and Node3 from “its view” of the cluster. As it is ‘alone’, the quorum of 2 is not there and it declares itself as undafe and shutdown.
This is normal vertica behavior (required to preserve data integrity) and cannot be considered as a bug.
So we need to understand why node2 cannot communicate with node1 and node3.
and on the workaround, If someone could enlighten me to configure cluster at disaster in a way such that replication don't stop. and snapshots are copied to remaining 2 nodes (1 and 3). and when node 2 comes up it will be recovered. safety setting which is configured at disaster site is as below.
SELECT current_fault_tolerance FROM system;
Looking for response.
Can you please help us with below detail:
1) Vertica Version
2) OS Version
3) Vertica is running on physical machine or VM's?
In general NETWORK change with 1 VS sets indicates that due resource saturation spread of vertica node2 is not able to communicate with rest of the nodes with in maximum time limit of 8 sec.
Dear Mohit, Thank you for response. Find below the required info
Vertica Version: 8.1
OS: RHEL server 6.1
Vertica running on VMs (Vmware).
and is there any way we keep replication going. using fault tolerance setting to 1 ? or some other way ?
I hope this helps.