We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Replication To DR Stops when 1 out of 3 node goes down — Vertica Forum

Replication To DR Stops when 1 out of 3 node goes down

veerkumarveerkumar Vertica Customer

We have 3 node cluster at primary site, 3 node cluster at disaster site. normally replication was working fine by VBR utility. lately it is observed that node 2 in disaster cluster goes down randomly due to which replication stops while two other nodes are up and running .

For some reasons, Node2 is losing connection with Node1 and Node3.
for example, around 04:37:36 on June 12th:
Node1: 2020-06-12 04:37:36.599 Spread Client:7ffa697fb700 [Comms] NETWORK change with 1 VS sets
Node3: 2020-06-12 04:37:36.603 Spread Client:7fa2717fb700 [Comms] NETWORK change with 1 VS sets
Node3: 2020-06-12 04:37:36.603 Spread Client:7fa2717fb700 [Comms] nodeSetNotifier: node v_ossa_node0002 left the cluster
Node2: 2020-06-12 04:37:36.786 EEThread:7f24cf5fe700-a00000095f920a [EE] Recv: Message receipt from v_ossa_node0001 failed [] handle=MultiplexedRecvHandle (0x7f2604008560) (10.126.217.59:5434)tag 1001 cancelId 0xa00000095fa806 CANCELED
Node1: 2020-06-12 04:37:36.601 Spread Client:7ffa697fb700 [Comms] nodeSetNotifier: node v_ossa_node0002 left the cluster
Node2: 2020-06-12 04:37:47.989 Spread Client:7f258f902700 [Comms] nodeSetNotifier: node v_ossa_node0001 left the cluster
Node2: 2020-06-12 04:37:48.029 Spread Client:7f258f902700 [Comms] nodeSetNotifier: node v_ossa_node0003 left the cluster
Node2: 2020-06-12 04:37:48.029 Spread Client:7f258f902700 [Recover] Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
Node2: 2020-06-12 04:37:48.029 Spread Client:7f258f902700 [Recover] Setting node v_ossa_node0002 to UNSAFE
The message of “ NETWORK change with 1 VS sets” means that the the network group changed probably because Node1 and Node3 did not receive the membership message from one of the Node2.
As a consequence Node1 and Node3 remove Node2 from the cluster and similarly Node2 removes Node1 and Node3 from “its view” of the cluster. As it is ‘alone’, the quorum of 2 is not there and it declares itself as undafe and shutdown.
This is normal vertica behavior (required to preserve data integrity) and cannot be considered as a bug.
So we need to understand why node2 cannot communicate with node1 and node3.

and on the workaround, If someone could enlighten me to configure cluster at disaster in a way such that replication don't stop. and snapshots are copied to remaining 2 nodes (1 and 3). and when node 2 comes up it will be recovered. safety setting which is configured at disaster site is as below.
SELECT get_design_ksafe();
 1
SELECT current_fault_tolerance FROM system;
 0

Looking for response.
Regards,
Veer

Answers

  • saxenamsaxenam Vertica Employee Employee

    Hi Veer,
    Can you please help us with below detail:
    1) Vertica Version
    2) OS Version
    3) Vertica is running on physical machine or VM's?
    In general NETWORK change with 1 VS sets indicates that due resource saturation spread of vertica node2 is not able to communicate with rest of the nodes with in maximum time limit of 8 sec.
    Regards,
    Mohit Saxena

  • veerkumarveerkumar Vertica Customer

    Dear Mohit, Thank you for response. Find below the required info
    Vertica Version: 8.1
    OS: RHEL server 6.1
    Vertica running on VMs (Vmware).
    and is there any way we keep replication going. using fault tolerance setting to 1 ? or some other way ?
    I hope this helps.
    Regards,
    Veer

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file