Database log

Eqtconsulting · November 2014

Hi,

A few days ago we suffered a fall down in our database and we are trying to find out how it happened.

We think the fall down occurred on the 11th of November 2014 at 03:41 AM. In the “veritca.log” log attached, on the 46443 row, it shows the command “shutdown”, but we don’t know if it refers to a server or a data base shut down. If it were the server, we would already have an answer, but if it were the database instead, we would need to know what was be cause. Based on the log attached, you think this could be the reason for it?

LOG attached: https://dl.dropboxusercontent.com/u/25310354/vertica.log

We wait for your comments.

Thanks in advanced.

Best regards,

msaxena · November 2014

2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> Saw membership message 6144 on V:abtlc
2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> NETWORK change with 1 VS sets
2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> DB Group changed
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [VMPI] <INFO> DistCall: Set current group members called with 1 members
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0002 left the cluster
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Setting node v_abtlc_node0001 to UNSAFE

As for why the node, left, there was no PANIC or ERRORS so it was not a vertica issue.

The issue for this type of case has been with the networking as seen in this statement, <INFO> NETWORK change with 1 VS sets. Was there any network issue on your side. issues with network can be with packet loss/drops, unreachable network, etc

Rahul_Choudhary · November 2014

As per the the vertica.log provided by you of Node 01 "NETWORK change with 1 VS sets" messages as can be seen below typically indicate that at the spread layer when it did a periodic check for group members it didn't get a response from one or more known members. This triggers it to drop those members from the group, which in Vertica means that the node(s) are no longer active parts of the data base so they shutdown.
In majority of the cases this is attributed to/traced back to either a period of network latency, network saturation. Basically any resource situation where the node can't respond back to the broadcast or peer to peer spread message.So I will suggest you to kindly check this with your respective IT/network team & look if they might have observed anything strange on the network side around the same time when DB/cluster went down.

Snippet from Vertica.log:
=========================

2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> NETWORK change with 1 VS sets
2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> DB Group changed
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [VMPI] <INFO> DistCall: Set current group members called with 1 members
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0002 left the cluster
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Setting node v_abtlc_node0001 to UNSAFE

2014-11-11 03:41:34.714 Spread Client:0x91260d0 [VMPI] <INFO> DistCall: Set current group members called with 0 members
2014-11-11 03:41:34.714 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0001 left the cluster
2014-11-11 03:41:34.714 Spread Client:0x91260d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2014-11-11 03:41:34.714 SafetyShutdown:0x7f35c000cf40 [Shutdown] <INFO> Shutting down this node
2014-11-11 03:41:34.714 Spread Client:0x91260d0 [Comms] <INFO> Lost membership of the DB group

Massimo_Loporch · November 2014

I think that the node v_abtlc_node0003 was down long ago and when v_abtlc_node0002 has left the cluster (probably a network delay between node 2 and 1), v_abtlc_node0001 (ksafety=1) was leave alone... then has shutdown the database.

2014-11-11 03:41:00.216 AnalyzeRowCount:0x7f35c400f840 <WARNING> @v_abtlc_node0001: 01000/4539: Received no response from v_abtlc_node0003 in GetRowCountMessage

...    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0002 left the cluster  ...  2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes    Ciao,   Massimo

Raul_1 · November 2014

Checking in the server /var/log/messages might give an idea, whether it was a server issue and what it might have been.

Database log

Comments

Leave a Comment