How to know what is killing Vertica process
This information was provided by Hibiki Serizawa
Vertica handles the SIGSEGV signal (segmentation fault) well by logging the information to vertica.log, ErrorReport.txt, and the core dump files. If Vertica is killed with the SIGKILL signal such as kill -9, there is no information in any of the log files (in Vertica or system side). Additionally, there is no information even if Vertica receives SIGSEGV signal. In this case, you need to capture the signal information on system side.
Following are two ways to capture this information, auditd and SystemTap.
auditd (Linux Auditing System)
The audit daemon writes audit records to the disk. It captures all system audit events including system calls and watches the file system objects. To capture the specific system calls, you need to configure auditd. auditd records any process that ends abnormally without additional configuration as the event is one of the system audit events.
To enable auditd to capture kill system call, run the following commands after starting up the auditd daemon:
$ auditctl -a exit,always -F arch=b64 -S kill -k audit_kill $ auditctl -a exit,always -F arch=b64 -S tkill -k audit_kill $ auditctl -a exit,always -F arch=b64 -S tgkill -k audit_kill
These configurations are temporary and exist only until restarting auditd or rebooting the system. To configure permanently, add the following lines to /etc/audit/audit.rules and restart auditd:
-a exit,always -F arch=b64 -S kill -k audit_kill -a exit,always -F arch=b64 -S tkill -k audit_kill -a exit,always -F arch=b64 -S tgkill -k audit_kill
Run the following commands to list the current configuration:
$ auditctl -l -a always,exit -F arch=b64 -S kill -F key=audit_kill -a always,exit -F arch=b64 -S tkill -F key=audit_kill -a always,exit -F arch=b64 -S tgkill -F key=audit_kill
In the this example, 'audit_kill' is specified as the value of -k option. This is the filter key that can be used to search the audit records later.
Try to kill the Vertica process after enabling auditd to capture the kill system call:
$ killall -9 vertica $ tail -f /var/log/audit/audit.log type=SYSCALL msg=audit(1592901047.912:529): arch=c000003e syscall=62 success=yes exit=0 a0=f66 a1=9 a2=0 a3=22d items=0 ppid=4172 pid=4173 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=1 comm="killall" exe="/usr/bin/killall" key="audit_kill" type=OBJ_PID msg=audit(1592901047.912:529): opid=3942 oauid=1000 ouid=1001 oses=1 ocomm="vertica" type=PROCTITLE msg=audit(1592901047.912:529): proctitle=6B696C6C616C6C002D390076657274696361
Search the audit records by using the filter key:
$ ausearch -i -k audit_kill ---- type=PROCTITLE msg=audit(06/23/2020 17:30:47.912:529) : proctitle=killall -9 vertica type=OBJ_PID msg=audit(06/23/2020 17:30:47.912:529) : opid=3942 oauid=duser ouid=dbadmin oses=1 ocomm=vertica type=SYSCALL msg=audit(06/23/2020 17:30:47.912:529) : arch=x86_64 syscall=kill success=yes exit=0 a0=0xf66 a1=SIGKILL a2=0x0 a3=0x22d items=0 ppid=4172 pid=4173 auid=duser uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=pts0 ses=1 comm=killall exe=/usr/bin/killall key=audit_kill
This outputenables you to figure out that /usr/bin/killall sent SIGKILL signal successfully to the Vertica process with PID 3942.
With the default configuration for auditd, the log file is rotated. If you want to search the records in the rotated files, use -if/--input option as in the following:
$ ausearch -i -k audit_kill -if /var/log/audit/audit.log.1
To search audit records for the process that ended abnormally, run the following command:
$ ausearch -i -m ANOM_ABEND ---- type=ANOM_ABEND msg=audit(06/22/2020 18:26:21.915:8331836) : auid=dbadmin uid=dbadmin gid=verticadba ses=458 pid=32595 comm=rdk:main exe=/opt/vertica/bin/vertica sig=SIGSEGV res=yes
This output enables you to figure out that the Vertica process was ended by the SIGSEGV signal and the segmentation fault is associated with rdk:main.
In most cases, auditd is installed by default. If not, install the audit package for RHEL/CentOS/SUSE or the auditd package for Debian/Ubuntu.
SystemTap
SystemTap is a tracing and probing tool that allows you to monitor the activities of the operating system by simply running user-written SystemTap scripts.
Install the following packages:
• systemtap
• systemtap-runtime
Additionally, you need to install the following packages to get information about the kernel. These packages must be for your kernel version.
• kernel-debuginfo
• kernel-debuginfo-common
• kernel-devel
To determine what kernel version your system is currently using, run the following command:
$ uname -r 3.10.0-1062.12.1.el7.x86_64
For example, if your kernel version is 3.10.0-1062.12.1.el7.x86_64, then you need to install the following RPMs:
kernel-debuginfo-3.10.0-1062.12.1.el7.x86_64.rpm
kernel-debuginfo-common-x86_64-3.10.0-1062.12.1.el7.x86_64.rpm
kernel-devel-3.10.0-1062.12.1.el7.x86_64.rpm
In case the kernel version is not the latest, these RPMs may not exist in the repository and you may need to install them manually. The following RPMs are for CentOS 7.7:
http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-3.10.0-1062.12.1.el7.x86_64.rpm
http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-common-x86_64-3.10.0-1062.12.1.el7.x86_64.rpm
http://vault.centos.org/7.7.1908/updates/x86_64/Packages/kernel-devel-3.10.0-1062.12.1.el7.x86_64.rpm
For Ubuntu, refer to https://wiki.ubuntu.com/Kernel/Systemtap.
For Debian, refer to https://wiki.debian.org/DebugPackage.
Create the SystemTap script as in the following to capture the signal sent to the specific process:
#! /usr/bin/env stap # # sigmon.stp for capturing the signals. # probe begin { printf("%-8s %-16s %-5s %-16s %6s %-16s\n", "SPID", "SNAME", "RPID", "RNAME", "SIGNUM", "SIGNAME") } probe signal.send { if (sig_pid == target()) printf("%-8d %-16s %-5d %-16s %-6d %-16s\n", pid(), execname(), sig_pid, pid_name, sig, sig_name) }
Run SystemTap with the above script in the background by providing Vertica process ID.
$ nohup stap -x `pgrep -o vertica` /tmp/sigmon.stp > /tmp/sigmon.log 2>&1 &
Try to kill the Vertica process after running SystemTap script:
$ killall -9 vertica $ tail -f /tmp/sigmon.log SPID SNAME RPID RNAME SIGNUM SIGNAME 9707 killall 6555 vertica 9 SIGKILL
This output enables you to figure that killall sent SIGKILL signal to the Vertica process with PID 6555.
Comments
@Amaksh , these are very detailed useful information for all. Are such information available/stored in Vertica Document repository anywhere for later reference?
@Sankarmn , thank you for your feedback. We will try to find a more appropriate place to post the troubleshooting tips.