How to delay Vertica node shutdown when k-safety assessment fails?
We're using a 3 nodes Vertica cluster. The network connection between the nodes sometimes fails for a short amount of time (ex : 30 seconds).
When this happens, all nodes quickly shut down as soon as they detect that other nodes are unreachable (because k-safety cannot be satisfied). For example, the following sequence is recorded in the vertica log by the node0003 :
00:04:30.633 node v_feedback_node0001 left the cluster ... 00:04:30.670 Node left cluster, reassessing k-safety... ... 00:04:32.389 node v_feedback_node0002 left the cluster ... 00:04:32.414 Changing node v_feedback_node0003 startup state from UP to UNSAFE ... 00:04:33.425 Shutting down this node ... 00:04:38.547 node v_feedback_node0003 left the cluster
Any of you knows if it's possible to configure a delay after which each node will try to reconnect to others before giving up and shutting down ?
0
Comments
This time is hard coded to 8 seconds.
I think time is better spent making the network more reliable. 30 sec of network failure is a lot (i mean really, really large, typically network rtt is in the microseconds). even if you kept vertica up by delaying k-safe assessment, nothing really can connect to the database, or most likely all db connections may reset.
what kind/type of network is in use.
Thanks a lot for your answer Skeswani.
Our Vertica servers are hosted on a VMWare infrastructure (I know it is not recommended). From time to time, a node "freezes" because the VM which hosts is moved from one physical host to another one. That is what is causing the 30 seconds network failure.
I'll follow your advice, and will see with our admins if thay can do something about it.
This is exactly why we recommend against vMotion. I dont think this is a network problem and network admins may not be able to help.
Do you have the ability to take a outage when migrating nodes?