How to know what is killing Vertica process

This information was provided by Hibiki Serizawa

Vertica handles the SIGSEGV signal (segmentation fault) well by logging the information to vertica.log, ErrorReport.txt, and the core dump files. If Vertica is killed with the SIGKILL signal such as kill -9, there is no information in any of the log files (in Vertica or system side). Additionally, there is no information even if Vertica receives SIGSEGV signal. In this case, you need to capture the signal information on system side.

Following are two ways to capture this information, auditd and SystemTap.

auditd (Linux Auditing System)
The audit daemon writes audit records to the disk. It captures all system audit events including system calls and watches the file system objects. To capture the specific system calls, you need to configure auditd. auditd records any process that ends abnormally without additional configuration as the event is one of the system audit events.

To enable auditd to capture kill system call, run the following commands after starting up the auditd daemon:

$ auditctl -a exit,always -F arch=b64 -S kill -k audit_kill
$ auditctl -a exit,always -F arch=b64 -S tkill -k audit_kill
$ auditctl -a exit,always -F arch=b64 -S tgkill -k audit_kill

These configurations are temporary and exist only until restarting auditd or rebooting the system. To configure permanently, add the following lines to /etc/audit/audit.rules and restart auditd:

-a exit,always -F arch=b64 -S kill -k audit_kill
-a exit,always -F arch=b64 -S tkill -k audit_kill
-a exit,always -F arch=b64 -S tgkill -k audit_kill

Run the following commands to list the current configuration:

$ auditctl -l
-a always,exit -F arch=b64 -S kill -F key=audit_kill
-a always,exit -F arch=b64 -S tkill -F key=audit_kill
-a always,exit -F arch=b64 -S tgkill -F key=audit_kill

In the this example, 'audit_kill' is specified as the value of -k option. This is the filter key that can be used to search the audit records later.

Try to kill the Vertica process after enabling auditd to capture the kill system call:

$ killall -9 vertica

$ tail -f /var/log/audit/audit.log
type=SYSCALL msg=audit(1592901047.912:529): arch=c000003e syscall=62 success=yes exit=0 a0=f66 a1=9 a2=0 a3=22d items=0 ppid=4172 pid=4173 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=1 comm="killall" exe="/usr/bin/killall" key="audit_kill"
type=OBJ_PID msg=audit(1592901047.912:529): opid=3942 oauid=1000 ouid=1001 oses=1 ocomm="vertica"
type=PROCTITLE msg=audit(1592901047.912:529): proctitle=6B696C6C616C6C002D390076657274696361

Search the audit records by using the filter key:

$ ausearch -i -k audit_kill
----
type=PROCTITLE msg=audit(06/23/2020 17:30:47.912:529) : proctitle=killall -9 vertica
type=OBJ_PID msg=audit(06/23/2020 17:30:47.912:529) : opid=3942 oauid=duser ouid=dbadmin oses=1 ocomm=vertica
type=SYSCALL msg=audit(06/23/2020 17:30:47.912:529) : arch=x86_64 syscall=kill success=yes exit=0 a0=0xf66 a1=SIGKILL a2=0x0 a3=0x22d items=0 ppid=4172 pid=4173 auid=duser uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=pts0 ses=1 comm=killall exe=/usr/bin/killall key=audit_kill

This outputenables you to figure out that /usr/bin/killall sent SIGKILL signal successfully to the Vertica process with PID 3942.

With the default configuration for auditd, the log file is rotated. If you want to search the records in the rotated files, use -if/--input option as in the following:

$ ausearch -i -k audit_kill -if /var/log/audit/audit.log.1

To search audit records for the process that ended abnormally, run the following command:

$ ausearch -i -m ANOM_ABEND
----
type=ANOM_ABEND msg=audit(06/22/2020 18:26:21.915:8331836) : auid=dbadmin uid=dbadmin gid=verticadba ses=458 pid=32595 comm=rdk:main exe=/opt/vertica/bin/vertica sig=SIGSEGV res=yes

This output enables you to figure out that the Vertica process was ended by the SIGSEGV signal and the segmentation fault is associated with rdk:main.

In most cases, auditd is installed by default. If not, install the audit package for RHEL/CentOS/SUSE or the auditd package for Debian/Ubuntu.

SystemTap

SystemTap is a tracing and probing tool that allows you to monitor the activities of the operating system by simply running user-written SystemTap scripts.

Install the following packages:

• systemtap
• systemtap-runtime

Additionally, you need to install the following packages to get information about the kernel. These packages must be for your kernel version.

• kernel-debuginfo
• kernel-debuginfo-common
• kernel-devel

To determine what kernel version your system is currently using, run the following command:

$ uname -r
3.10.0-1062.12.1.el7.x86_64

For example, if your kernel version is 3.10.0-1062.12.1.el7.x86_64, then you need to install the following RPMs:

kernel-debuginfo-3.10.0-1062.12.1.el7.x86_64.rpm
kernel-debuginfo-common-x86_64-3.10.0-1062.12.1.el7.x86_64.rpm
kernel-devel-3.10.0-1062.12.1.el7.x86_64.rpm

In case the kernel version is not the latest, these RPMs may not exist in the repository and you may need to install them manually. The following RPMs are for CentOS 7.7:

http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-3.10.0-1062.12.1.el7.x86_64.rpm
http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-common-x86_64-3.10.0-1062.12.1.el7.x86_64.rpm
http://vault.centos.org/7.7.1908/updates/x86_64/Packages/kernel-devel-3.10.0-1062.12.1.el7.x86_64.rpm

For Ubuntu, refer to https://wiki.ubuntu.com/Kernel/Systemtap.
For Debian, refer to https://wiki.debian.org/DebugPackage.

Create the SystemTap script as in the following to capture the signal sent to the specific process:

 #! /usr/bin/env stap

#
# sigmon.stp for capturing the signals.
#

probe begin
{
  printf("%-8s %-16s %-5s %-16s %6s %-16s\n",
         "SPID", "SNAME", "RPID", "RNAME", "SIGNUM", "SIGNAME")
}

probe signal.send 
{
  if (sig_pid == target())
    printf("%-8d %-16s %-5d %-16s %-6d %-16s\n", 
           pid(), execname(), sig_pid, pid_name, sig, sig_name)
}

Run SystemTap with the above script in the background by providing Vertica process ID.

$ nohup stap -x `pgrep -o vertica` /tmp/sigmon.stp > /tmp/sigmon.log 2>&1 &

Try to kill the Vertica process after running SystemTap script:

$ killall -9 vertica

$ tail -f /tmp/sigmon.log
SPID     SNAME            RPID  RNAME            SIGNUM SIGNAME
9707     killall          6555  vertica          9      SIGKILL

This output enables you to figure that killall sent SIGKILL signal to the Vertica process with PID 6555.

Comments

  • @Amaksh , these are very detailed useful information for all. Are such information available/stored in Vertica Document repository anywhere for later reference?

  • HibikiHibiki Vertica Employee Employee

    @Sankarmn , thank you for your feedback. We will try to find a more appropriate place to post the troubleshooting tips.

Sign In or Register to comment.