verticad keeps crashing

I'm working on a proof of concept using the Vertica community edition (7.0.1-0 x86_64) and every few days, the verticad process dies on one of the nodes. 

I haven't even loaded data yet and all I've found in the vertca.log file are messages about the node leaving the cluster. What else can I check? I'd like to know why this process if failing. 

2014-08-26 09:39:01.013 LowDiskSpaceCheck:0x7f9204010ae0-c00000000e6b7b [Txn] <INFO> Rollback Txn: c00000000e6b7b 'LowDiskSpaceCheck'2014-08-26 09:39:01.015 LowDiskSpaceCheck:0x7f9204010ae0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-08-26 09:39:05.049 Spread Client:0x6413b50 [Comms] <INFO> Saw membership message 5120 on V:poc_db
2014-08-26 09:39:05.049 Spread Client:0x6413b50 [Comms] <INFO> DB Group changed
2014-08-26 09:39:05.049 Spread Client:0x6413b50 [VMPI] <INFO> DistCall: Set current group members called with 2 members
2014-08-26 09:39:05.050 Spread Client:0x6413b50 [Comms] <INFO> nodeSetNotifier: node v_poc_db_node0016 left the cluster
2014-08-26 09:39:05.050 Spread Client:0x6413b50 [Recover] <INFO> Node left cluster, reassessing k-safety...
2014-08-26 09:39:05.050 Spread Client:0x6413b50 [Recover] <INFO> Checking Deps:Down bits: 001 Deps:
011 - cnt: 1
101 - cnt: 1
110 - cnt: 1
111 - cnt: 2
2014-08-26 09:39:05.050 Spread Client:0x6413b50 <LOG> @v_poc_db_node0018: 00000/3298: Event Posted: Event Code:3 Event Id:0 Event Severity: Critical [2] PostedTimestamp: 2014-08-26 09:39:05.050351 ExpirationTimestamp: 2082-09-13 11:53:12.050351 EventCodeDescription: Current Fault Tolerance at Critical Level ProblemDescription: Loss of node v_poc_db_node0018 will cause shutdown to occur. K=1 total number of nodes=3 DatabaseName: poc_db Hostname: dvcrinfa136
2014-08-26 09:39:05.052 Spread Client:0x6413b50 [Comms] <INFO> Saw membership message 5120 on Vertica:all
2014-08-26 09:39:05.052 Spread Client:0x6413b50 [Comms] <INFO> Removing #node_a#N010017106021->v_poc_db_node0016 from processToNode and other maps due to departure from Vertica:all
2014-08-26 09:39:05.052 Spread Client:0x6413b50 [Comms] <INFO> nodeToState map:
2014-08-26 09:39:05.052 Spread Client:0x6413b50 [Comms] <INFO>   v_poc_db_node0017 : UP
2014-08-26 09:39:05.052 Spread Client:0x6413b50 [Comms] <INFO>   v_poc_db_node0018 : UP


Comments

  • Hi Ryan,

    At first glance, it looks to me like you have a very flaky or overloaded network.  Did your cluster pass the network tests during installation?

    You're specifically seeing nodes drop out of and re-join the Spread cluster.  Spread is what Vertica uses for distributed synchronization.  From node A's perspective, node B will appear to have dropped out of the cluster if, after many ("many thousands", actually) attempts to get in touch with node B, it is continuing to not respond.

    "verticad" doesn't exactly crash in this case -- what happens is, if node A can no longer see either node B or C (in a three-node cluster) because of network trouble, then node A realizes that it is unable to present a consistent view of your data because it has lost touch with so much of the cluster that it won't receive updates, can't process distributed COMMITs, etc.  In this case, it saves its state and shuts down.  It can safely be restarted after the network has been fixed.

    Note that Vertica makes heavy use of UDP, especially UDP Broadcast, for its Spread layer.  This is efficient for us, but it is somewhat uncommon, so can cause problems depending on how your network devices are configured.  There are a bunch of posts about Spread elsewhere on this forum; feel free to look around, or ask here if you can't figure out how to solve your problem.

    Adam



  • Hi Ryan!
    it looks to me like you have a very flaky or overloaded network
    Can you check for STALE CHECKPOINT?
    What is LGE on each node? Can you try to advance AHM (select make_ahm_now())?
     

  • Thanks for the great info. I ran the vnetperf tool again and didn't see any issues. All of the node failures have been off hours, so it's possible that the network is unstable. 

    I ran the make_ahm_now and the result is:

     AHM set (New AHM Epoch: 2)

    Not sure how to interpret that but I'll find it in the administrator documentation. 

    Thanks!


Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file