Database log

Hi,

A few days ago we suffered a fall down in our database and we are trying to find out how it happened.

We think the fall down occurred on the 11th of November 2014 at 03:41 AM. In the “veritca.log” log attached, on the 46443 row, it shows the command “shutdown”, but we don’t know if it refers to a server or a data base shut down. If it were the server, we would already have an answer, but if it were the database instead, we would need to know what was be cause. Based on the log attached, you think this could be the reason for it?

LOG attached: https://dl.dropboxusercontent.com/u/25310354/vertica.log


 We wait for your comments.

 Thanks in advanced.

 Best regards,


Comments

  • 2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> Saw membership message 6144 on V:abtlc
    2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> NETWORK change with 1 VS sets
    2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> DB Group changed
    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [VMPI] <INFO> DistCall: Set current group members called with 1 members
    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0002 left the cluster
    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Setting node v_abtlc_node0001 to UNSAFE

    As for why the node, left, there was no PANIC or ERRORS so it was not a vertica issue.

    The issue for this type of case has been with the networking as seen in this statement, <INFO> NETWORK change with 1 VS sets. Was there any network issue on your side. issues with network can be with packet loss/drops, unreachable network, etc
  • As per the the vertica.log provided by you of Node 01 "NETWORK change with 1 VS sets"  messages as can be seen below typically indicate that at the spread layer when it did a periodic check for group members it didn't get a response from one or more known members. This triggers it to drop those members from the group, which in Vertica means that the node(s) are no longer active parts of the data base so they shutdown.
    In majority of the cases this is attributed to/traced back to either a period of network latency, network saturation. Basically any resource situation where the node can't respond back to the broadcast or peer to peer spread message.So I will suggest you to kindly check this with your respective IT/network team & look if they  might have observed anything strange on the network side around the same time when DB/cluster went down.

    Snippet from Vertica.log:
    =========================

    2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> NETWORK change with 1 VS sets
    2014-11-11 03:41:34.711 Spread Client:0x91260d0 [Comms] <INFO> DB Group changed
    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [VMPI] <INFO> DistCall: Set current group members called with 1 members
    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0002 left the cluster
    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Setting node v_abtlc_node0001 to UNSAFE


    2014-11-11 03:41:34.714 Spread Client:0x91260d0 [VMPI] <INFO> DistCall: Set current group members called with 0 members
    2014-11-11 03:41:34.714 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0001 left the cluster
    2014-11-11 03:41:34.714 Spread Client:0x91260d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
    2014-11-11 03:41:34.714 SafetyShutdown:0x7f35c000cf40 [Shutdown] <INFO> Shutting down this node
    2014-11-11 03:41:34.714 Spread Client:0x91260d0 [Comms] <INFO> Lost membership of the DB group



  • I think that the node v_abtlc_node0003  was down long ago and when  v_abtlc_node0002 has left the cluster (probably a network delay between node 2 and 1),  v_abtlc_node0001 (ksafety=1)  was leave alone... then has shutdown the database.


    2014-11-11 03:41:00.216 AnalyzeRowCount:0x7f35c400f840 <WARNING> @v_abtlc_node0001: 01000/4539: Received no response from v_abtlc_node0003 in GetRowCountMessage
    ...    2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Comms] <INFO> nodeSetNotifier: node v_abtlc_node0002 left the cluster  ...  2014-11-11 03:41:34.712 Spread Client:0x91260d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes    Ciao,   Massimo      

  • Checking in the server /var/log/messages might give an idea, whether it was a server issue and what it might have been.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file