Cluster shutting down when refreshing projection

Recently added a new node to our DB (moving from single node, to a 2 node cluster). Of our 3 projections, 1 was correctly distributed over both nodes, but the AdminTools rebalance operation changed the other 2 to SEGMENTED BY HASH(...) NODES <first node> instead of ALL NODES and so the projection did not get redistributed. (I have no idea why that would be).

 

Attempting to fix this whilst the DB is in use, I have re-created the projection with ALL NODES specified and triggered START_REFRESH() (is there a better way?). This runs for about 8 hours without seemingly doing much, then starts heavily using the disk for read (don't see many writes), and then after about 10 hours from starting the cluster will shutdown. I have repeated this 3 times and it always aborts the refresh and shutdowns after the same amount of time.

 

Logs from node #1

 

 

2015-10-19 10:24:45.502 Spread Client:0x9766050 [Comms] <INFO> Saw membership message 8192 on V:analytics
2015-10-19 10:24:47.875 Spread Client:0x9766050 [Comms] <INFO> Saw transitional message; watch for lost daemons
2015-10-19 10:24:51.010 EEThread:0x7f5ef76d2350-a0000002abb3e5 [EE] <WARNING> Recv: Message receipt from v_analytics_node0002 failed [canceled] handle=MultiplexedRecvHandle (0x7f6264001bc0) (10.0.4.52:5434)tag 1001 cancelId 0xa0000002ab$
2015-10-19 10:24:51.910 Spread Client:0x9766050 [Comms] <INFO> Saw membership message 8192 on Vertica:all
2015-10-19 10:24:51.910 Spread Client:0x9766050 [Comms] <INFO> Saw transitional message; watch for lost daemons
2015-10-19 10:24:51.911 Spread Client:0x9766050 [Comms] <INFO> Saw membership message 8192 on Vertica:join
2015-10-19 10:24:51.911 Spread Client:0x9766050 [Comms] <INFO> Saw transitional message; watch for lost daemons
2015-10-19 10:24:51.911 Spread Client:0x9766050 [Comms] <INFO> Saw membership message 6144 on V:analytics
2015-10-19 10:24:51.911 Spread Client:0x9766050 [Comms] <INFO> NETWORK change with 1 VS sets
2015-10-19 10:24:51.911 Spread Client:0x9766050 [Comms] <INFO> VS set #0 (mine) has 1 members (offset=36)
2015-10-19 10:24:52.443 Spread Client:0x9766050 [Comms] <INFO> VS set #0, member 0: #node_a#N010000004011
2015-10-19 10:24:54.161 Spread Client:0x9766050 [Comms] <INFO> DB Group changed
2015-10-19 10:25:25.994 Spread Client:0x9766050 [Comms] <INFO> nodeSetNotifier: node v_analytics_node0002 left the cluster
2015-10-19 10:25:26.649 Spread Client:0x9766050 [Recover] <INFO> Node left cluster, reassessing k-safety...
2015-10-19 10:25:36.996 Spread Client:0x9766050 [Recover] <INFO> Cluster partitioned: 2 total nodes, 1 up nodes, 1 down nodes
2015-10-19 10:25:36.996 Spread Client:0x9766050 [Recover] <INFO> Setting node v_analytics_node0001 to UNSAFE

 

 

Logs from node #2

 

 

2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> Saw membership message 8192 on V:analytics
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> Saw membership message 8192 on Vertica:all
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> Saw membership message 8192 on Vertica:join
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> Saw membership message 6144 on V:analytics
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> NETWORK change with 1 VS sets
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> VS set #0 (mine) has 1 members (offset=36)
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> VS set #0, member 0: #node_b#N010000004052
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> DB Group changed
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Comms] <INFO> nodeSetNotifier: node v_analytics_node0001 left the cluster
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Recover] <INFO> Cluster partitioned: 2 total nodes, 1 up nodes, 1 down nodes
2015-10-19 10:24:01.088 Spread Client:0x80a15d0 [Recover] <INFO> Setting node v_analytics_node0002 to UNSAFE

I have no idea where to go from here.

Comments

  •  Hi, 

    Take a look at your host /var/log/messagesthat wnet down, run this :

     

    cat /var/log/messages| grep "Out of memory"

     Match the time of your node shutdown, if this would be the case it means you run out of memory and the host just sacrificed the most hungry process.

  • Nothing strange to see in syslog, certainly no memory messages.

     

    Yesterday I dropped the projection completely and recreated it (rather than trying to do refresh with some data already written to the nodes) and that seems to have done it. I guess there was some previously written data or state information that it didnt like - though I thought if a refresh was interupted it was aborted completely, yet there was a lot of data left behind each time.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file