Help - Vertica Database Shuts Down Every ~10 Days
I need help understaing why Vertica Database Shuts Down Every ~10 Days
Misc Notes:
1 - There is no spread.conf file in my configuration.
2 - There is no spread.log file in my configuration.
3 - K-Safety is set to 0
4 - There is only one node in the cluster.
5 - The command used to install was:
/opt/vertica/sbin/install_vertica --hosts db-srv4, db-srv5 --rpm /root/software/vertica/vertica-7.2.1-0.x86_64.RHEL6.rpm --dba-user dbadmin --data-dir /u01_vertica/data
Here's a log of the most recent shutdown.
2016-11-20 02:22:00.391 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> error SP_receive: Connection closed by spread
2016-11-20 02:22:00.563 TM Moveout:0x7fa040012000-b00000001fc9dc [Txn] <INFO> Begin Txn: b00000001fc9dc 'Moveout: Tuple Mover'
2016-11-20 02:22:00.564 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> error SP_receive: The network socket experienced an error. This Spread mailbox will no longer work until the connection is disconnected and then reconnected
2016-11-20 02:22:00.642 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> stop: disconnecting #node_b#N192168200226 from spread daemon, Mbox=13
...
2016-11-20 02:22:01.070 HeartbeatGenerator:0x7fa0400126a0 [Comms] <WARNING> Node Heartbeat: Not proceeding with heartbeat checks. node in cluster?: 0, Heartbeat Started?: 1
2016-11-20 02:22:01.075 Spread Client:0x83d7110 [Comms] <WARNING> error SP_receive: Illegal spread was provided
...
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31c9
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31ca
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cb
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cc
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cd
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31ce
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cf
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d0
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d2
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d3
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31fa
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> nodeSetNotifier: node v_report01_node0002 left the cluster
2016-11-20 02:22:01.092 RebalanceCluster:0x7fa040012cd0 [Util] <INFO> Task 'RebalanceCluster' enabled
2016-11-20 02:22:01.092 LGELaggingCheck:0x7fa040014220 [Util] <INFO> Task 'LGELaggingCheck' enabled
2016-11-20 02:22:01.104 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Node left cluster, reassessing k-safety...
2016-11-20 02:22:01.121 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Setting node v_report01_node0002 to UNSAFE
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2016-11-20 02:22:01.121254 ExpirationTimestamp: 2084-12-08 05:36:08.121254 EventCodeDescription: Node State Change ProblemDescription: Changing node v_report01_node0002 startup state to UNSAFE DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3293: Event Cleared: Event Code:6 Event Id:6 Event Severity: Informational [6] PostedTimestamp: 2016-11-20 02:22:01.169587 ExpirationTimestamp: 2016-11-20 02:22:01.169587 EventCodeDescription: Node State Change ProblemDescription: Changing node v_report01_node0002 leaving startup state UP DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Changing node v_report01_node0002 startup state from UP to UNSAFE
2016-11-20 02:22:01.184 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2016-11-20 02:22:01.184184 ExpirationTimestamp: 2016-11-20 02:32:01.184184 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=0 total number of nodes=1 DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> NodeHeartbeatManager: SP_stop_monitoring invoked
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> NodeHeartbeatManager: SP_stop_monitoring failed with return code -11
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> NodeHeartbeatManager: Notifying the thread waiting on health_check message before disabling HeartbeatGenerator service
2016-11-20 02:22:01.203 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> Lost membership of the DB group
2016-11-20 02:22:01.203 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> Removing #node_b#N192168200226->v_report01_node0002 from processToNode and other maps due to departure from Vertica:all
...
2016-11-20 02:22:01.229 SafetyShutdown:0x7fa04c005400 [Shutdown] <INFO> Shutting down this node
0
Comments
Hi,
Can you post your ErrorReport.txt file , is located in the catalog path of you database.
The file is empty.
Ok, i don`t think is good to go into troubleshooting as you are running Vertica in not recommended setup (single node).
but for the fun of it:
Can you look for ERROR in your Vertica.log file. and post
Can you also post the dbLog is .. bellow you catalog path.
dbLOG
Vertica Log:
Yeah not to much to work with
What version are you in ?
Also can you see anything in the /var/log/messages that might relate to a OOM Kiler for the Vertica Proccess ?
What is the size of your swap space on this host ?
Can you list the error_messages in vertica for before the vertica crush ?
SWAP = 50 GB
/var/log/messages (for Nov 20):
Error messages in the vertica log before the crash is listed above in the problem description.
How do you know that the OOM KIller killed the process?
Thanks!
Linux SO will over commit memory and when **bleep** will hit the fan he will kill the largest memory consumer
Find oom log
or
What can you do next is log you Systme resources
where
Memory
swpd: the amount of virtual memory used.
free: the amount of idle memory.
buff: the amount of memory used as buffers.
cache: the amount of memory used as cache.
inact: the amount of inactive memory. (-a option)
active: the amount of active memory. (-a option)
Swap
si: Amount of memory swapped in from disk (/s).
so: Amount of memory swapped to disk (/s).
this is just to take away the posibility this might be a resource caused outage.
Here is the output:
I highly doubt it's resource issue. It's a HUGE machine - 512 GB of memory - 50GB SWAP.
I have a feeling it's a networking or configuration issue.
Thanks again !
I have wasted my tools of troubleshooting !
Let me ask you whay would you have a single node DB ?
And why running in such a big machine ?
Hi,
You are saying that you are running vertica in a single node, but in the installation command you have shared you passed as parameter two hosts.
This might be the issue.
Regards,
Lucas L.-
Yes, I recently removed db-srv5 node from the cluster. But the restarts having been happening before I removed that node.