Database log
Hi,
A few days ago we suffered a fall down in our database and we are trying to find out how it happened.
We think the fall down occurred on the 11th of November 2014 at 03:41 AM. In the “veritca.log” log attached, on the 46443 row, it shows the command “shutdown”, but we don’t know if it refers to a server or a data base shut down. If it were the server, we would already have an answer, but if it were the database instead, we would need to know what was be cause. Based on the log attached, you think this could be the reason for it?
LOG attached: https://dl.dropboxusercontent.com/u/25310354/vertica.log
We wait for your comments.
Thanks in advanced.
Best regards,0
Comments
2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> NETWORK change with 1 VS sets
2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> DB Group changed
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [VMPI] <INFO> DistCall: Set current group members called with 1 members
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0002 left the cluster
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Setting node v_abtlc_node0001 to UNSAFE
As for why the node, left, there was no PANIC or ERRORS so it was not a vertica issue.
The issue for this type of case has been with the networking as seen in this statement, <INFO> NETWORK change with 1 VS sets. Was there any network issue on your side. issues with network can be with packet loss/drops, unreachable network, etc
In majority of the cases this is attributed to/traced back to either a period of network latency, network saturation. Basically any resource situation where the node can't respond back to the broadcast or peer to peer spread message.So I will suggest you to kindly check this with your respective IT/network team & look if they might have observed anything strange on the network side around the same time when DB/cluster went down.
Snippet from Vertica.log:
=========================
2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> NETWORK change with 1 VS sets
2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> DB Group changed
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [VMPI] <INFO> DistCall: Set current group members called with 1 members
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0002 left the cluster
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Setting node v_abtlc_node0001 to UNSAFE
2014-11-11 03:41:34.714 Spread Client:0x91260d0 [VMPI] <INFO> DistCall: Set current group members called with 0 members
2014-11-11 03:41:34.714 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0001 left the cluster
2014-11-11 03:41:34.714 Spread Client:0x91260d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2014-11-11 03:41:34.714 SafetyShutdown:0x7f35c000cf40 [Shutdown] <INFO> Shutting down this node
2014-11-11 03:41:34.714 Spread Client:0x91260d0 [Comms] <INFO> Lost membership of the DB group
I think that the node v_abtlc_node0003 was down long ago and when v_abtlc_node0002 has left the cluster (probably a network delay between node 2 and 1), v_abtlc_node0001 (ksafety=1) was leave alone... then has shutdown the database.
2014-11-11 03:41:00.216 AnalyzeRowCount:0x7f35c400f840 <WARNING> @v_abtlc_node0001: 01000/4539: Received no response from v_abtlc_node0003 in GetRowCountMessage
Checking in the server /var/log/messages might give an idea, whether it was a server issue and what it might have been.