spread client Saw membership message 8192 on V: - duplicate
Hi,
In our production vertica cluster (3 nodes), one node has been crashing (not the same node) for 3 times this week.
I didn't find no error in /var/log*.log files.
Our cluster is installed on AWS. It's version is : v7.1.1.
-----------------
2016-10-22 13:22:11.003 LGELaggingCheck:0x7f508c013b30-c0000000e27980 [Txn] <INFO> Begin Txn: c0000000e27980 'Recovery: Get last good epoch'
2016-10-22 13:22:11.007 LGELaggingCheck:0x7f508c013b30-c0000000e27980 [Txn] <INFO> Starting Commit: Txn: c0000000e27980 'Recovery: Get last good epoch'
2016-10-22 13:22:11.007 LGELaggingCheck:0x7f508c013b30 [Txn] <INFO> Commit Complete: Txn: c0000000e27980 at epoch 0x2590c6
2016-10-22 13:22:11.008 LGELaggingCheck:0x7f508c013b30-c0000000e27981 [Txn] <INFO> Begin Txn: c0000000e27981 'Check LGE'
2016-10-22 13:22:11.008 LGELaggingCheck:0x7f508c013b30-c0000000e27981 [Txn] <INFO> Rollback Txn: c0000000e27981 'Check LGE'
2016-10-22 13:22:11.008 LGELaggingCheck:0x7f508c013b30 [Util] <INFO> Task 'LGELaggingCheck' enabled
2016-10-22 13:22:27.450 Spread Client:0x8d5a1a0 [Comms] <INFO> Saw membership message 8192 on V:xxxxx
2016-10-22 13:22:27.450 Spread Client:0x8d5a1a0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2016-10-22 13:22:27.450 Spread Client:0x8d5a1a0 [Comms] <INFO> Saw membership message 8192 on Vertica:all
2016-10-22 13:22:27.450 Spread Client:0x8d5a1a0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2016-10-22 13:22:27.450 Spread Client:0x8d5a1a0 [Comms] <INFO> Saw membership message 8192 on Vertica:join
2016-10-22 13:22:27.450 Spread Client:0x8d5a1a0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2016-10-22 13:22:27.450 Spread Client:0x8d5a1a0 [Comms] <INFO> Saw membership message 6144 on V:officedw
2016-10-22 13:22:27.450 Spread Client:0x8d5a1a0 [Comms] <INFO> NETWORK change with 1 VS sets
2016-10-22 13:22:27.450 Spread Client:0x8d5a1a0 [Comms] <INFO> DB Group changed
------------------------------------
2016-10-21 03:12:47.079 Init Session:0x7f35a4015860 <FATAL> @v_xxxx_node0003:
{SessionRun} 57V03/5785: Cluster Status Request by 10.165.30.20:54401
HINT: Cluster State: xxxx
UP: 3 of 3 (v_xxxx_node0001, v_xxxx_node0002, v_xxxx_node0003)
----
LOCATION: initSession, /scratch_a/release/30493/vbuild/vertica/Session/ClientSession.cpp:436
2016-10-21 03:13:02.585 Init Session:0x7f35a4012c40 <LOG> @v_xxxx_node0003: 00000/2705: Connection received: host=10.165.30.20 port=54405 (connCnt 3)
2016-10-21 03:13:02.585 Init Session:0x7f35a4012c40 <LOG> @v_xxxx_node0003: 00000/4540: Received SSL negotiation startup packet
2016-10-21 03:13:02.585 Init Session:0x7f35a4012c40 <LOG> @v_xxxx_node0003: 00000/4691: Sending SSL negotiation response 'N'
2016-10-21 03:13:02.586 Init Session:0x7f35a4012c40 <FATAL> @v_xxxx_node0003: {SessionRun}
57V03/5785: Cluster Status Request by 10.165.30.20:54405
HINT: Cluster State: xxxx
UP: 3 of 3 (v_xxxx_node0001, v_xxxx_node0002, v_xxxx_node0003)
LOCATION: initSession, /scratch_a/release/30493/vbuild/vertica/Session/ClientSession.cpp:436
2016-10-21 03:13:16.260 Spread Client:0x8f5a940 [Comms] <INFO> Saw membership message 8192 on V:xxxx
2016-10-21 03:13:16.260 Spread Client:0x8f5a940 [Comms] <INFO> Saw transitional message; watch for lost daemons
2016-10-21 03:13:16.260 Spread Client:0x8f5a940 [Comms] <INFO> Saw membership message 8192 on Vertica:all
2016-10-21 03:13:16.260 Spread Client:0x8f5a940 [Comms] <INFO> Saw transitional message; watch for lost daemons
2016-10-21 03:13:16.260 Spread Client:0x8f5a940 [Comms] <INFO> Saw membership message 8192 on Vertica:join
2016-10-21 03:13:16.260 Spread Client:0x8f5a940 [Comms] <INFO> Saw transitional message; watch for lost daemons
2016-10-21 03:13:16.260 Spread Client:0x8f5a940 [Comms] <INFO> Saw membership message 6144 on V:xxxx
2016-10-21 03:13:16.260 Spread Client:0x8f5a940 [Comms] <INFO> NETWORK change with 1 VS sets
2016-10-21 03:13:16.260 Spread Client:0x8f5a940 [Comms] <INFO> DB Group changed
---------------------------
2016-10-18 15:23:21.170 Init Session:0x7f46ec019690-b000000196f6a4 [Txn] <INFO> Starting Commit: Txn: b000000196f6a4 ' select count(node_name) as 'Low' FROM (select node_name, round(used_bytes_end_value*100.00/(used_bytes_end_value+free_bytes_end_value), 2.0) as val_used from v_internal.dc_storage_info_by_minute where time > sysdate() - interval '1 min' and filesystem = 'vertica' order by node_name) as disk_use WHERE disk_use.val_used > (select current_value from configuration_parameters where parameter_name = 'LowDiskSpaceWarningPct')'
2016-10-18 15:23:21.172 Init Session:0x7f46ec019690 [Txn] <INFO> Commit Complete: Txn: b000000196f6a4 at epoch 0x24d6df
2016-10-18 15:23:21.175 Init Session:0x7f46ec019690 [Session] <INFO> [PQuery] TX:0(verticanode2-8536:0x93a86) select CONCAT (cast(count(node_name) as varchar(12)), ' Nodes') as 'Total' from v_internal.dc_storage_info_by_minute where time > sysdate() - interval '1 min' and filesystem = 'vertica'
2016-10-18 15:23:21.176 Init Session:0x7f46ec019690 [Session] <INFO> [Query] TX:0(verticanode2-8536:0x93a86) select CONCAT (cast(count(node_name) as varchar(12)), ' Nodes') as 'Total' from v_internal.dc_storage_info_by_minute where time > sysdate() - interval '1 min' and filesystem = 'vertica'
2016-10-18 15:23:21.180 Init Session:0x7f46ec019690-b000000196f6b9 [Txn] <INFO> Begin Txn: b000000196f6b9 ' select CONCAT (cast(count(node_name) as varchar(12)), ' Nodes') as 'Total' from v_internal.dc_storage_info_by_minute where time > sysdate() - interval '1 min' and filesystem = 'vertica''
2016-10-18 15:23:21.465 Init Session:0x7f46ec013ac0-b000000196f6b7 [EE] <INFO> SortManager found maxMerges 7 too small(0 MB Assigned).
2016-10-18 15:23:21.465 Init Session:0x7f46ec013ac0-b000000196f6b7 [EE] <INFO> After disabling optimization, maxMerges becomes 7.
2016-10-18 15:23:29.001 ManageEpochs:0x7f46d0014d90 [Util] <INFO> Task 'ManageEpochs' enabled
2016-10-18 15:23:29.002 AnalyzeRowCount:0x7f46d0013be0 [Util] <INFO> Task 'AnalyzeRowCount' enabled
2016-10-18 15:23:38.238 Spread Client:0x9694f80 [Comms] <INFO> Saw membership message 8192 on V:xxxx
2016-10-18 15:23:38.238 Spread Client:0x9694f80 [Comms] <INFO> Saw transitional message; watch for lost daemons
2016-10-18 15:23:38.238 Spread Client:0x9694f80 [Comms] <INFO> Saw membership message 8192 on Vertica:all
2016-10-18 15:23:38.238 Spread Client:0x9694f80 [Comms] <INFO> Saw transitional message; watch for lost daemons
2016-10-18 15:23:38.238 Spread Client:0x9694f80 [Comms] <INFO> Saw membership message 8192 on Vertica:join
2016-10-18 15:23:38.238 Spread Client:0x9694f80 [Comms] <INFO> Saw transitional message; watch for lost daemons
2016-10-18 15:23:38.253 Spread Client:0x9694f80 [Comms] <INFO> Saw membership message 6144 on xxxx
2016-10-18 15:23:38.253 Spread Client:0x9694f80 [Comms] <INFO> NETWORK change with 1 VS sets
2016-10-18 15:23:38.253 Spread Client:0x9694f80 [Comms] <INFO> DB Group changed
Has someone any idea?
Thank you,
Veronica
Comments
This is the message of interest :
<INFO> NETWORK change with 1 VS sets
that means that the the network group changed and probably because it did not receive the membership message from one of the nodes. The heartbeat message should be sent in intervals of 8 seconds and for one node they do not read it the other nodes remove the node from the cluster and the node when see that is not part of the cluster it shutdown. So you should look at your UDP traffic, there is a system table that shows details dc_spread_monitor.
Hope this helps.
Eugenia
Also, please notice this document
https://community.dev.hpe.com/t5/Vertica-Knowledge-Base/Spread-Debugging/ta-p/237864
that talks about debbuging spread, it will provide you more info to find out if you have an spread issue.
Eugenia