Node go DOWN immediatly after network error

Joseph · June 2020

Hi,
Our 6 node cluster(9.3.1) ,node goes down every time there is network error.

    * vertica.log
*     2020-04-29 04:42:06.340 Spread Mailbox Dequeue:0x7fbf777fe700 [Comms] <WARNING> error SP_receive: Connection closed by spread
*     2020-04-29 04:42:06.375 Spread Mailbox Dequeue:0x7fbf777fe700 [Comms] <WARNING> error SP_receive: The network socket experienced an error. This Spread mailbox will no longer work until the connection is disconnected and then reconnected
*     2020-04-29 04:42:06.376 Spread Mailbox Dequeue:0x7fbf777fe700 [Comms] <INFO> stop: disconnecting #node_a#N192168000001 from spread daemon, Mbox=6
*     2020-04-29 04:42:06.380 Spread Mailbox Dequeue:0x7fbf777fe700 [Comms] <INFO> connected: false
*     
* 
* spread.log
* [Wed 29 Apr 2020 04:40:16] Sess_read: received a heartbeat on 'node_a' ( mailbox 10 )
* [Wed 29 Apr 2020 04:40:16] Pushed eviction timeout back 600.000000s
* [Wed 29 Apr 2020 04:40:16] Send_join: State is 4
* [Wed 29 Apr 2020 04:40:17] Send_join: State is 4
* [Wed 29 Apr 2020 04:40:18] Send_join: State is 4
* [Wed 29 Apr 2020 04:40:19] Send_join: State is 4
* [Wed 29 Apr 2020 04:40:20] Send_join: State is 4
* [Wed 29 Apr 2020 04:40:21] Send_join: State is 4
* [Wed 29 Apr 2020 04:41:22] Send_join: State is 4
* [Wed 29 Apr 2020 04:41:23] Send_join: State is 4
* [Wed 29 Apr 2020 04:41:24] Send_join: State is 4
* [Wed 29 Apr 2020 04:41:25] Send_join: State is 4
* [Wed 29 Apr 2020 04:41:26] Send_join: State is 4
* [Wed 29 Apr 2020 04:41:27] Send_join: State is 4
* [Wed 29 Apr 2020 04:41:28] Send_join: State is 4
* [Wed 29 Apr 2020 04:42:06] Prot_handle_token: illegally lowering Aru from Token: Aru 138599728 -> 138599717; (Last_token->aru = 138599728)
* Exit caused by Alarm!
*  1.

I looked network,memory,cpu usage everything looks fine.

LenoyJ · June 2020

[Wed 29 Apr 2020 04:41:28] Send_join: State is 4

[Wed 29 Apr 2020 04:42:06] Prot_handle_token:...

This kind of indicates that spread was stalled for over 35 seconds. By default, spread's heartbeat timeout is 8 seconds for clusters with one spread segment. Is spread logging turned on? If so, it should be disabled by default. Spread should focus on sending communication messages instead and not logging. See this best practices guide for more information.

If you're on a virtualized environment (like Azure), you should probably also adjust the spread timeout. Some cloud environments are known to have network freezes for hypervisor updates and the like. See this doc on how: https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/UsingVerticaOnAzure/AdjustingSpreadDaemonTimeouts.htm

Node go DOWN immediatly after network error

Answers

Leave a Comment