Spread Client received SIGSEGV. Did we hit a bug?
One of our clusters went down yesterday and I've been troubleshooting why. The cluster is a 3-node cluster running "Vertica Analytic Database v6.1.2-0".
Here are snippets of the relevant logs from each of the nodes in the cluster.
Node 1
(BTW, sorry about the formatting. This editor is very difficult to work with regarding code.)
Here are snippets of the relevant logs from each of the nodes in the cluster.
Node 1
2014-05-21 13:35:07.691 Spread Client:0x8b75400 [Comms] <INFO> nodeSetNotifier: node v_montreal_node0002 left the cluster 2014-05-21 13:35:07.691 Spread Client:0x8b75400 [Recover] <INFO> Node left cluster, reassessing k-safety... ... 2014-05-21 13:35:07.740 Init Session:0x7fa97e2206e0-a0000000abf7a8 [Dist] <WARNING> Node v_montreal_node0002 was in plan recipents list at plan st art but is now DOWN 2014-05-21 13:35:07.763 Spread Client:0x8b75400 <LOG> @v_montreal_node0001: 00000/3298: Event Posted: Event Code:3 Event Id:0 Event Severity: Crit ical [2] PostedTimestamp: 2014-05-21 13:35:07.692123 ExpirationTimestamp: 2082-06-08 16:49:14.692123 EventCodeDescription: Current Fault Tolerance at Critical Level ProblemDescription: Loss of node v_montreal_node0001 will cause shutdown to occur. K=1 total number of nodes=3 DatabaseName: mo ntreal Hostname: ip-10-0-4-1 2014-05-21 13:35:07.764 Spread Client:0x8b75400 [Comms] <INFO> nodeSetNotifier: node v_montreal_node0003 left the cluster 2014-05-21 13:35:07.764 Spread Client:0x8b75400 [Recover] <INFO> Node left cluster, reassessing k-safety... 2014-05-21 13:35:07.764 Spread Client:0x8b75400 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes 2014-05-21 13:35:07.764 Spread Client:0x8b75400 [Recover] <INFO> Setting node v_montreal_node0001 to UNSAFE 2014-05-21 13:35:07.774 Spread Client:0x8b75400 <LOG> @v_montreal_node0001: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Info rmational [6] PostedTimestamp: 2014-05-21 13:35:07.77451 ExpirationTimestamp: 2082-06-08 16:49:14.77451 EventCodeDescription: Node State Change Pr oblemDescription: Changing node v_montreal_node0001 startup state to UNSAFE DatabaseName: montreal Hostname: ip-10-0-4-1 2014-05-21 13:35:21.426 unknown:0x7fa9b4f78700 [Init] <INFO> Shutdown complete. Exiting. 2014-05-21 16:00:15.459 INFO New logNode 2
2014-05-21 13:42:57.962 Spread Client:0x7bac900 [Main] <INFO> Handling signal: 11 2014-05-21 13:42:58.800 Spread Client:0x7bac900 [Main] <PANIC> Received fatal signal SIGSEGV. 2014-05-21 13:42:58.801 Spread Client:0x7bac900 [Main] <PANIC> Info: si_code: 1, si_pid: -1498248300, si_uid: 32678, si_addr: 0x7fa6a6b28b94 2014-05-21 16:00:15.586 INFO New logNode 3
2014-05-21 03:15:01.601 Main:0x65d12b0 [Init] <INFO> Starting up Vertica Analytic Database v6.1.2-02014-05-21 13:40:09.001 ManageEpochs:0x7f28d400f6e0 [Util] <INFO> Task 'ManageEpochs' enabled 2014-05-21 13:40:12.565 Spread Client:0x8535520 <PANIC> @v_montreal_node0003: VX001/3927: Message could not be deserialized: stream error: <whitespace snipped> LOCATION: messageReceived, /scratch_a/release/vbuild/vertica/Dist/DistCalls.cpp:9410 2014-05-21 13:40:13.002 AnalyzeRowCount:0x7f28d43fbca0 [Util] <INFO> Task 'AnalyzeRowCount' enabled 2014-05-21 13:40:13.183 Spread Client:0x8535520 [Main] <PANIC> Wrote backtrace to ErrorReport.txt 2014-05-21 16:00:15.377 INFO New log
What happened here? Did we hit some kind of bug in Vertica?(BTW, sorry about the formatting. This editor is very difficult to work with regarding code.)
0
Comments
This could be a Vertica bug. It's certainly a serious issue somewhere. If you have an Enterprise account, I would advise you to file a support case so they can help you track it down.
Assuming that it's not a Vertica bug for the moment, there are a few other things it could be. It could be a hardware issue. Bad RAM will often lead to invalid memory accesses (== SIGSEGV). That's certainly the most common external source of this kind of failure. A memtest-type utility can detect if this is going on, but most of these require rebooting the machine in question.
It could be an issue in the network -- Vertica assumes that, if it sends a packet to itself, that packet will either get through or get dropped; we're not robust against "byzantine failures" such as data corruption that somehow gets through the hardware-level and Linux-kernel-level filters. (This can happen if, for example, you have a bad router or firewall in the middle of your private network -- all Vertica machines should be on the same switch, without routers or hardware firewalls between them.)
Since this is on the Spread thread, it could be some other application trying to do something clever with Spread's traffic. Spread is an open-source library/protocol; we occasionally see people trying to monitor or tinker with it in order to track what's going on with the cluster. Not the best idea...; if you mess with Vertica's low-level internals, we may not work as expected. These days, you should be able to get all that information out of Vertica through our clients and system tables; much safer and easier.
If it is a Vertica issue, we'd really like to know what's causing it. Our Support folks have the tools to figure that out. Short of that -- if you can find and post a specific set of operations that consistently makes the crash happen again, it'd be useful to know. As always, try updating to the latest patch; we do fix bugs from time to time.
Adam
I will open a support case if it happens again.
Because there are correctness issues associated with missing a control message, the message's recipient saw the corrupted partial message and shut itself down so that it could re-join the cluster and run our recovery logic to recover what it missed. (Ordinary network traffic won't do this, but Spread deals with making sure the cluster is in a consistent state. Corrupt Spread traffic is very unusual and always bad... We have various safeguards against sending bad messages; but if a node is crashing hard due to memory corruption, there's only so much you can do...)
In a larger cluster, this would probably not take down the whole cluster (even at K=1, thanks to data K-safety).
But Vertica requires that *more than half* of the cluster be visible at all times. Otherwise, the cluster might have gotten segmented; if one cluster splits and turns into two clusters, we could never merge them again, and the "more than half" rule prevents this from happening even if the two halves can't communicate at all. The third node noticed that it couldn't see 2/3 of the cluster, so it shut itself down.
So, yeah -- you unfortunately had a whole string of things line up just the wrong way to take down the whole cluster. Very unusual. If you see it happen again, please do file a Support case.