Help - Vertica Database Shuts Down Every ~10 Days

Christopher_1 · November 2016

I need help understaing why Vertica Database Shuts Down Every ~10 Days

Misc Notes:

1 - There is no spread.conf file in my configuration.

2 - There is no spread.log file in my configuration.

3 - K-Safety is set to 0

4 - There is only one node in the cluster.

5 - The command used to install was:

/opt/vertica/sbin/install_vertica --hosts db-srv4, db-srv5 --rpm /root/software/vertica/vertica-7.2.1-0.x86_64.RHEL6.rpm --dba-user dbadmin --data-dir /u01_vertica/data

Here's a log of the most recent shutdown.

2016-11-20 02:22:00.391 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> error SP_receive: Connection closed by spread
2016-11-20 02:22:00.563 TM Moveout:0x7fa040012000-b00000001fc9dc [Txn] <INFO> Begin Txn: b00000001fc9dc 'Moveout: Tuple Mover'
2016-11-20 02:22:00.564 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> error SP_receive: The network socket experienced an error. This Spread mailbox will no longer work until the connection is disconnected and then reconnected
2016-11-20 02:22:00.642 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> stop: disconnecting #node_b#N192168200226 from spread daemon, Mbox=13
...
2016-11-20 02:22:01.070 HeartbeatGenerator:0x7fa0400126a0 [Comms] <WARNING> Node Heartbeat: Not proceeding with heartbeat checks. node in cluster?: 0, Heartbeat Started?: 1
2016-11-20 02:22:01.075 Spread Client:0x83d7110 [Comms] <WARNING> error SP_receive: Illegal spread was provided
...
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31c9
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31ca
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cb
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cc
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cd
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31ce
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cf
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d0
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d2
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d3
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31fa
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> nodeSetNotifier: node v_report01_node0002 left the cluster
2016-11-20 02:22:01.092 RebalanceCluster:0x7fa040012cd0 [Util] <INFO> Task 'RebalanceCluster' enabled
2016-11-20 02:22:01.092 LGELaggingCheck:0x7fa040014220 [Util] <INFO> Task 'LGELaggingCheck' enabled
2016-11-20 02:22:01.104 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Node left cluster, reassessing k-safety...
2016-11-20 02:22:01.121 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Setting node v_report01_node0002 to UNSAFE
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2016-11-20 02:22:01.121254 ExpirationTimestamp: 2084-12-08 05:36:08.121254 EventCodeDescription: Node State Change ProblemDescription: Changing node v_report01_node0002 startup state to UNSAFE DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3293: Event Cleared: Event Code:6 Event Id:6 Event Severity: Informational [6] PostedTimestamp: 2016-11-20 02:22:01.169587 ExpirationTimestamp: 2016-11-20 02:22:01.169587 EventCodeDescription: Node State Change ProblemDescription: Changing node v_report01_node0002 leaving startup state UP DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Changing node v_report01_node0002 startup state from UP to UNSAFE
2016-11-20 02:22:01.184 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2016-11-20 02:22:01.184184 ExpirationTimestamp: 2016-11-20 02:32:01.184184 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=0 total number of nodes=1 DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> NodeHeartbeatManager: SP_stop_monitoring invoked
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> NodeHeartbeatManager: SP_stop_monitoring failed with return code -11
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> NodeHeartbeatManager: Notifying the thread waiting on health_check message before disabling HeartbeatGenerator service
2016-11-20 02:22:01.203 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> Lost membership of the DB group
2016-11-20 02:22:01.203 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> Removing #node_b#N192168200226->v_report01_node0002 from processToNode and other maps due to departure from Vertica:all
...
2016-11-20 02:22:01.229 SafetyShutdown:0x7fa04c005400 [Shutdown] <INFO> Shutting down this node

Adrian_Oprea_1 · November 2016

Hi,

Can you post your ErrorReport.txt file , is located in the catalog path of you database.

Christopher_1 · November 2016

The file is empty.

[root@681380-db-srv4 v_report01_node0002_catalog]# cat ErrorReport.txt
[root@681380-db-srv4 v_report01_node0002_catalog]#
[root@681380-db-srv4 v_report01_node0002_catalog]# pwd
/data/report01/v_report01_node0002_catalog
[root@681380-db-srv4 v_report01_node0002_catalog]#

Adrian_Oprea_1 · November 2016

Ok, i don`t think is good to go into troubleshooting as you are running Vertica in not recommended setup (single node).

but for the fun of it:

Can you look for ERROR in your Vertica.log file. and post

Can you also post the dbLog is .. bellow you catalog path.

Christopher_1 · November 2016

dbLOG

11/20/16 02:10:45 SP_scat_receive: failed receiving header on session 13 (ret: 0 len: 0): Resource temporarily unavailable
11/20/16 02:22:00 SP_disconnect: mbox=13, pid=6161, send_group=#node_b#N192168200226

Vertica Log:

2016-11-20 02:22:01.075 Spread Client:0x83d7110 [Comms] <WARNING> error SP_receive: Illegal spread was provided
...
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] 2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Changing node v_report01_node0002 startup state from UP to UNSAFE

Adrian_Oprea_1 · November 2016

Yeah not to much to work with

What version are you in ?

Also can you see anything in the /var/log/messages that might relate to a OOM Kiler for the Vertica Proccess ?

What is the size of your swap space on this host ?

Can you list the error_messages in vertica for before the vertica crush ?

Christopher_1 · November 2016

SWAP = 50 GB

/var/log/messages (for Nov 20):

Nov 20 00:02:07 681380-db-srv4 kernel: [11332901.348685] EXT4-fs (dm-14): 5 orphan inodes deleted
Nov 20 00:02:07 681380-db-srv4 kernel: [11332901.349129] EXT4-fs (dm-14): recovery complete
Nov 20 00:02:07 681380-db-srv4 kernel: [11332901.364178] EXT4-fs (dm-14): mounted filesystem with ordered data mode. Opts:

Error messages in the vertica log before the crash is listed above in the problem description.

How do you know that the OOM KIller killed the process?

Thanks!

Adrian_Oprea_1 · November 2016

Linux SO will over commit memory and when **bleep** will hit the fan he will kill the largest memory consumer

Find oom log

grep -i 'killed process' /var/log/messages

or

dmesg | egrep -i 'killed process'

What can you do next is log you Systme resources

vmstat -SM 60 10000 > memoryusage.out &

where

Memory
swpd: the amount of virtual memory used.
free: the amount of idle memory.
buff: the amount of memory used as buffers.
cache: the amount of memory used as cache.
inact: the amount of inactive memory. (-a option)
active: the amount of active memory. (-a option)

Swap
si: Amount of memory swapped in from disk (/s).
so: Amount of memory swapped to disk (/s).

this is just to take away the posibility this might be a resource caused outage.

Christopher_1 · November 2016

Here is the output:

[root@681380-db-srv4 log]# grep -i 'killed process' /var/log/messages
[root@681380-db-srv4 log]#
[root@681380-db-srv4 log]# zgrep -i 'killed process' /var/log/messages*.gz
[root@681380-db-srv4 log]#
[root@681380-db-srv4 log]# dmesg | egrep -i 'killed process'
[root@681380-db-srv4 log]#
[root@681380-db-srv4 log]# vmstat -SM 60 10000 > memoryusage.out &
[1] 46326
[root@681380-db-srv4 log]#
[root@681380-db-srv4 log]# cat memoryusage.out
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0  34592  55064    107 109680    0    0  1117    70    0    0  5  1 92  2  0

I highly doubt it's resource issue. It's a HUGE machine - 512 GB of memory - 50GB SWAP.

I have a feeling it's a networking or configuration issue.

Thanks again !

Adrian_Oprea_1 · November 2016

I have wasted my tools of troubleshooting !

Let me ask you whay would you have a single node DB ?

And why running in such a big machine ?

LucasLedesma · November 2016

Hi,

You are saying that you are running vertica in a single node, but in the installation command you have shared you passed as parameter two hosts.

--hosts db-srv4, db-srv5

This might be the issue.

Regards,

Lucas L.-

Christopher_1 · November 2016

Yes, I recently removed db-srv5 node from the cluster. But the restarts having been happening before I removed that node.

Help - Vertica Database Shuts Down Every ~10 Days

Comments

Leave a Comment