Vertica 8.1.1 OOM (constantly getting killed)

We have been plagued with OOM issues. Today we had two nodes fail within 15 minutes of each other. Every month we have at least a node fail due to OOM issues. Here are the details:

DB Version: 8.1.1-6 (3 nodes)
AWS AMI: Vertica 8.1.1 CentOS 7.3 - 1498566984-38d06046-9fbd-4e9e-8f59-cfdb7b6de752-ami-751f2e63.4 (ami-85ffe3fc)
OS: Centos 7.3 3.10.0-514.6.2.el7.x86_64
gLibc: glibc-2.17-196.el7.x86_64

RAM: total used free shared buff/cache available
Mem: 62G 4.4G 40G 491M 17G 57G
Swap: 15G 727M 15G

Resource pools:

w/memorysize set:
sysquery: 64M, sysdata: 100M, wosdata: 2G, tm: 2G, p_dashboard (custom pool): 8G (cascades to general)

w/maxmemorysize set:
general: 48G, sysdata: 1GB, wosdata: 2G, jvm: 2GB, monitoring: 2GB, blobdata: 10% (not used, we don't run any machine learning).

OOM dmesg logs:
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[331310] 1001 331310 25760521 15801473 42609 3996082 0 vertica

Shows vertica with 60.2GB rss, and 15GB swapents; no other process has even close to 1GB rss.

sysctl: (changes to base AMI:) vm.swappiness=1

Any help would be greatly appreciated.


Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file