We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Help - Vertica Database Shuts Down Every ~10 Days — Vertica Forum

Help - Vertica Database Shuts Down Every ~10 Days

I need help understaing why Vertica Database Shuts Down Every ~10 Days

 

Misc Notes:

1 - There is no spread.conf file in my configuration.

2 - There is no spread.log file in my configuration.

3 - K-Safety is set to 0

4 - There is only one node in the cluster.

5 - The command used to install was:

     

/opt/vertica/sbin/install_vertica --hosts db-srv4, db-srv5 --rpm /root/software/vertica/vertica-7.2.1-0.x86_64.RHEL6.rpm --dba-user dbadmin --data-dir /u01_vertica/data

 

Here's a log of the most recent shutdown.

2016-11-20 02:22:00.391 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> error SP_receive: Connection closed by spread
2016-11-20 02:22:00.563 TM Moveout:0x7fa040012000-b00000001fc9dc [Txn] <INFO> Begin Txn: b00000001fc9dc 'Moveout: Tuple Mover'
2016-11-20 02:22:00.564 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> error SP_receive: The network socket experienced an error. This Spread mailbox will no longer work until the connection is disconnected and then reconnected
2016-11-20 02:22:00.642 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> stop: disconnecting #node_b#N192168200226 from spread daemon, Mbox=13
...
2016-11-20 02:22:01.070 HeartbeatGenerator:0x7fa0400126a0 [Comms] <WARNING> Node Heartbeat: Not proceeding with heartbeat checks. node in cluster?: 0, Heartbeat Started?: 1
2016-11-20 02:22:01.075 Spread Client:0x83d7110 [Comms] <WARNING> error SP_receive: Illegal spread was provided
...
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31c9
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31ca
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cb
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cc
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cd
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31ce
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31cf
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d0
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d2
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31d3
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] <INFO> Removing 45035996273721318 from list of initialized nodes for session 681380-db-srv4.fox1-6161:0xb31fa
2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> nodeSetNotifier: node v_report01_node0002 left the cluster
2016-11-20 02:22:01.092 RebalanceCluster:0x7fa040012cd0 [Util] <INFO> Task 'RebalanceCluster' enabled
2016-11-20 02:22:01.092 LGELaggingCheck:0x7fa040014220 [Util] <INFO> Task 'LGELaggingCheck' enabled
2016-11-20 02:22:01.104 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Node left cluster, reassessing k-safety...
2016-11-20 02:22:01.121 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Setting node v_report01_node0002 to UNSAFE
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2016-11-20 02:22:01.121254 ExpirationTimestamp: 2084-12-08 05:36:08.121254 EventCodeDescription: Node State Change ProblemDescription: Changing node v_report01_node0002 startup state to UNSAFE DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3293: Event Cleared: Event Code:6 Event Id:6 Event Severity: Informational [6] PostedTimestamp: 2016-11-20 02:22:01.169587 ExpirationTimestamp: 2016-11-20 02:22:01.169587 EventCodeDescription: Node State Change ProblemDescription: Changing node v_report01_node0002 leaving startup state UP DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Changing node v_report01_node0002 startup state from UP to UNSAFE
2016-11-20 02:22:01.184 Spread Mailbox Dequeue:0x83e1630 <LOG> @v_report01_node0002: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2016-11-20 02:22:01.184184 ExpirationTimestamp: 2016-11-20 02:32:01.184184 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=0 total number of nodes=1 DatabaseName: report01 Hostname: 681380-db-srv4.fox1.com
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> NodeHeartbeatManager: SP_stop_monitoring invoked
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <WARNING> NodeHeartbeatManager: SP_stop_monitoring failed with return code -11
2016-11-20 02:22:01.202 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> NodeHeartbeatManager: Notifying the thread waiting on health_check message before disabling HeartbeatGenerator service
2016-11-20 02:22:01.203 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> Lost membership of the DB group
2016-11-20 02:22:01.203 Spread Mailbox Dequeue:0x83e1630 [Comms] <INFO> Removing #node_b#N192168200226->v_report01_node0002 from processToNode and other maps due to departure from Vertica:all
...
2016-11-20 02:22:01.229 SafetyShutdown:0x7fa04c005400 [Shutdown] <INFO> Shutting down this node

Comments

  •  Hi, 

    Can you post your ErrorReport.txt file , is located in the catalog path of you database.

     

  • The file is empty.

     

    [root@681380-db-srv4 v_report01_node0002_catalog]# cat ErrorReport.txt
    [root@681380-db-srv4 v_report01_node0002_catalog]#
    [root@681380-db-srv4 v_report01_node0002_catalog]# pwd
    /data/report01/v_report01_node0002_catalog
    [root@681380-db-srv4 v_report01_node0002_catalog]#
  •     Ok, i don`t think is good to go into troubleshooting as you are running Vertica in not recommended setup (single node).

       but for the fun of it:

    Can you look for ERROR in your Vertica.log file. and post

    Can you also post the dbLog is .. bellow you catalog path.

  • dbLOG

    11/20/16 02:10:45 SP_scat_receive: failed receiving header on session 13 (ret: 0 len: 0): Resource temporarily unavailable
    11/20/16 02:22:00 SP_disconnect: mbox=13, pid=6161, send_group=#node_b#N192168200226

    Vertica Log:

    2016-11-20 02:22:01.075 Spread Client:0x83d7110 [Comms] <WARNING> error SP_receive: Illegal spread was provided
    ...
    2016-11-20 02:22:01.080 Spread Mailbox Dequeue:0x83e1630 [VMPI] 2016-11-20 02:22:01.169 Spread Mailbox Dequeue:0x83e1630 [Recover] <INFO> Changing node v_report01_node0002 startup state from UP to UNSAFE
  •  

    Yeah not to much to work with :)

     

     

     What version are you in ? 

     

    Also can you see anything in the /var/log/messages that might relate to a OOM Kiler for the Vertica Proccess ? 

     

    What is the size of your swap space on this host ? 

     

    Can you list the error_messages in vertica for before the vertica crush ?

     

  •  

    SWAP = 50 GB

     

    /var/log/messages (for Nov 20):

    Nov 20 00:02:07 681380-db-srv4 kernel: [11332901.348685] EXT4-fs (dm-14): 5 orphan inodes deleted
    Nov 20 00:02:07 681380-db-srv4 kernel: [11332901.349129] EXT4-fs (dm-14): recovery complete
    Nov 20 00:02:07 681380-db-srv4 kernel: [11332901.364178] EXT4-fs (dm-14): mounted filesystem with ordered data mode. Opts:

    Error messages in the vertica log before the crash is listed above in the problem description.

     

    How do you know that the OOM KIller killed the process?

     

    Thanks!

  •  Linux SO will over commit memory and when **bleep** will hit the fan he will kill the largest memory consumer 

     

    Find oom log

    grep -i 'killed process' /var/log/messages

    or

    dmesg | egrep -i 'killed process'

     

     

    What can you do next is log you Systme resources 

    vmstat -SM 60 10000 > memoryusage.out &

     

     where 

    Memory
    swpd: the amount of virtual memory used.
    free: the amount of idle memory.
    buff: the amount of memory used as buffers.
    cache: the amount of memory used as cache.
    inact: the amount of inactive memory. (-a option)
    active: the amount of active memory. (-a option)

    Swap
    si: Amount of memory swapped in from disk (/s).
    so: Amount of memory swapped to disk (/s).
     

    this is just to take away the posibility this might be a resource caused outage. 

     

     

  • Here is the output:

     

    [root@681380-db-srv4 log]# grep -i 'killed process' /var/log/messages
    [root@681380-db-srv4 log]#
    [root@681380-db-srv4 log]# zgrep -i 'killed process' /var/log/messages*.gz
    [root@681380-db-srv4 log]#
    [root@681380-db-srv4 log]# dmesg | egrep -i 'killed process'
    [root@681380-db-srv4 log]#
    [root@681380-db-srv4 log]# vmstat -SM 60 10000 > memoryusage.out &
    [1] 46326
    [root@681380-db-srv4 log]#
    [root@681380-db-srv4 log]# cat memoryusage.out
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    0 0 34592 55064 107 109680 0 0 1117 70 0 0 5 1 92 2 0

    I highly doubt it's resource issue.  It's a HUGE machine - 512 GB of memory - 50GB SWAP.

    I have a feeling it's a networking or configuration issue.

     

    Thanks again !

  •  

     I have wasted my tools of troubleshooting :)

     

    Let me ask you whay would you have a single node DB ? 

    And why running in such a big machine ? 

  • Hi,

     

    You are saying that you are running vertica in a single node, but in the installation command you have shared you passed as parameter two hosts.

     

    --hosts db-srv4, db-srv5

     

    This might be the issue.

     

    Regards,

    Lucas L.-

  • Yes, I recently removed db-srv5 node from the cluster.  But the restarts having been happening before I removed that node.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file