Help - Vertica Database Shuts Down Every ~10 Days

I need help understaing why Vertica Database Shuts Down Every ~10 Days

 

Misc Notes:

1 - There is no spread.conf file in my configuration.

2 - There is no spread.log file in my configuration.

3 - K-Safety is set to 0

4 - There is only one node in the cluster.

5 - The command used to install was:

     

/opt/vertica/sbin/install_vertica --hosts db-srv4, db-srv5 --rpm /root/software/vertica/vertica-7.2.1-0.x86_64.RHEL6.rpm --dba-user dbadmin --data-dir /u01_vertica/data

 

Here's a log of the most recent shutdown.

2016-11-20 02:22:00.391 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> error SP_receive: Connection closed by spread
2016-11-20 02:22:00.563 TM Moveout:0x7fa040012000-b00000001fc9dc [Txn] <INFO> Begin Txn: b00000001fc9dc 'Moveout: Tuple Mover'
2016-11-20 02:22:00.564 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> error SP_receive: The network socket experienced an error. This Spread mailbox will no longer work until the connection is disconnected and then reconnected
2016-11-20 02:22:00.642 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> stop: disconnecting #node_b#N192168200226 from spread daemon, Mbox=13
...
2016-11-20 02:22:01.070 HeartbeatGenerator:0x7fa0400126a0 [Comms] <WARNING> Node Heartbeat: Not proceeding with heartbeat checks. node in cluster?: 0, Heartbeat Started?: 1
2016-11-20 02:22:01.075 Spread Client:0x83d7110 [Comms] <WARNING> error SP_receive: Illegal spread was provided
...
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31c9
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31ca
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cb
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cc
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cd
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31ce
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cf
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d0
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d2
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d3
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31fa
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> nodeSetNotifier: node v_report01_node0002 left the cluster
2016-11-20 02:22:01.092 RebalanceCluster:0x7fa040012cd0 [Util] <INFO> Task 'RebalanceCluster' enabled
2016-11-20 02:22:01.092 LGELaggingCheck:0x7fa040014220 [Util] <INFO> Task 'LGELaggingCheck' enabled
2016-11-20 02:22:01.104 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Node left cluster, reassessing k-safety...
2016-11-20 02:22:01.121 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Setting node v_report01_node0002 to UNSAFE
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2016-11-20 02:22:01.121254 ExpirationTimestamp: 2084-12-08 05:36:08.121254 EventCodeDescription: Node State Change ProblemDescription: Changing node v_report01_node0002 startup state to UNSAFE DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3293: Event Cleared: Event Code:6 Event Id:6 Event Severity: Informational [6] PostedTimestamp: 2016-11-20 02:22:01.169587 ExpirationTimestamp: 2016-11-20 02:22:01.169587 EventCodeDescription: Node State Change ProblemDescription: Changing node v_report01_node0002 leaving startup state UP DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Changing node v_report01_node0002 startup state from UP to UNSAFE
2016-11-20 02:22:01.184 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2016-11-20 02:22:01.184184 ExpirationTimestamp: 2016-11-20 02:32:01.184184 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=0 total number of nodes=1 DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> NodeHeartbeatManager: SP_stop_monitoring invoked
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> NodeHeartbeatManager: SP_stop_monitoring failed with return code -11
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> NodeHeartbeatManager: Notifying the thread waiting on health_check message before disabling HeartbeatGenerator service
2016-11-20 02:22:01.203 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> Lost membership of the DB group
2016-11-20 02:22:01.203 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> Removing #node_b#N192168200226->v_report01_node0002 from processToNode and other maps due to departure from Vertica:all
...
2016-11-20 02:22:01.229 SafetyShutdown:0x7fa04c005400 [Shutdown] <INFO> Shutting down this node

Comments

  •  Hi, 

    Can you post your ErrorReport.txt file , is located in the catalog path of you database.

     

  • The file is empty.

     

    [root@681380-db-srv4 v_report01_node0002_catalog]# cat ErrorReport.txt
    [root@681380-db-srv4 v_report01_node0002_catalog]#
    [root@681380-db-srv4 v_report01_node0002_catalog]# pwd
    /data/report01/v_report01_node0002_catalog
    [root@681380-db-srv4 v_report01_node0002_catalog]#
  •     Ok, i don`t think is good to go into troubleshooting as you are running Vertica in not recommended setup (single node).

       but for the fun of it:

    Can you look for ERROR in your Vertica.log file. and post

    Can you also post the dbLog is .. bellow you catalog path.

  • dbLOG

    11/20/16 02:10:45 SP_scat_receive: failed receiving header on session 13 (ret: 0 len: 0): Resource temporarily unavailable
    11/20/16 02:22:00 SP_disconnect: mbox=13, pid=6161, send_group=#node_b#N192168200226

    Vertica Log:

    2016-11-20 02:22:01.075 Spread Client:0x83d7110 [Comms] <WARNING> error SP_receive: Illegal spread was provided
    ...
    2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] 2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Changing node v_report01_node0002 startup state from UP to UNSAFE
  •  

    Yeah not to much to work with :)

     

     

     What version are you in ? 

     

    Also can you see anything in the /var/log/messages that might relate to a OOM Kiler for the Vertica Proccess ? 

     

    What is the size of your swap space on this host ? 

     

    Can you list the error_messages in vertica for before the vertica crush ?

     

  •  

    SWAP = 50 GB

     

    /var/log/messages (for Nov 20):

    Nov 20 00:02:07 681380-db-srv4 kernel: [11332901.348685] EXT4-fs (dm-14): 5 orphan inodes deleted
    Nov 20 00:02:07 681380-db-srv4 kernel: [11332901.349129] EXT4-fs (dm-14): recovery complete
    Nov 20 00:02:07 681380-db-srv4 kernel: [11332901.364178] EXT4-fs (dm-14): mounted filesystem with ordered data mode. Opts:

    Error messages in the vertica log before the crash is listed above in the problem description.

     

    How do you know that the OOM KIller killed the process?

     

    Thanks!

  •  Linux SO will over commit memory and when **bleep** will hit the fan he will kill the largest memory consumer 

     

    Find oom log

    grep -i 'killed process' /var/log/messages

    or

    dmesg | egrep -i 'killed process'

     

     

    What can you do next is log you Systme resources 

    vmstat -SM 60 10000 > memoryusage.out &

     

     where 

    Memory
    swpd: the amount of virtual memory used.
    free: the amount of idle memory.
    buff: the amount of memory used as buffers.
    cache: the amount of memory used as cache.
    inact: the amount of inactive memory. (-a option)
    active: the amount of active memory. (-a option)

    Swap
    si: Amount of memory swapped in from disk (/s).
    so: Amount of memory swapped to disk (/s).
     

    this is just to take away the posibility this might be a resource caused outage. 

     

     

  • Here is the output:

     

    [root@681380-db-srv4 log]# grep -i 'killed process' /var/log/messages
    [root@681380-db-srv4 log]#
    [root@681380-db-srv4 log]# zgrep -i 'killed process' /var/log/messages*.gz
    [root@681380-db-srv4 log]#
    [root@681380-db-srv4 log]# dmesg | egrep -i 'killed process'
    [root@681380-db-srv4 log]#
    [root@681380-db-srv4 log]# vmstat -SM 60 10000 > memoryusage.out &
    [1] 46326
    [root@681380-db-srv4 log]#
    [root@681380-db-srv4 log]# cat memoryusage.out
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    0 0 34592 55064 107 109680 0 0 1117 70 0 0 5 1 92 2 0

    I highly doubt it's resource issue.  It's a HUGE machine - 512 GB of memory - 50GB SWAP.

    I have a feeling it's a networking or configuration issue.

     

    Thanks again !

  •  

     I have wasted my tools of troubleshooting :)

     

    Let me ask you whay would you have a single node DB ? 

    And why running in such a big machine ? 

  • Hi,

     

    You are saying that you are running vertica in a single node, but in the installation command you have shared you passed as parameter two hosts.

     

    --hosts db-srv4, db-srv5

     

    This might be the issue.

     

    Regards,

    Lucas L.-

  • Yes, I recently removed db-srv5 node from the cluster.  But the restarts having been happening before I removed that node.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file