Vertica 8.1.1 OOM (constantly getting killed)
We have been plagued with OOM issues. Today we had two nodes fail within 15 minutes of each other. Every month we have at least a node fail due to OOM issues. Here are the details:
DB Version: 8.1.1-6 (3 nodes)
AWS AMI: Vertica 8.1.1 CentOS 7.3 - 1498566984-38d06046-9fbd-4e9e-8f59-cfdb7b6de752-ami-751f2e63.4 (ami-85ffe3fc)
OS: Centos 7.3 3.10.0-514.6.2.el7.x86_64
gLibc: glibc-2.17-196.el7.x86_64
RAM: total used free shared buff/cache available
Mem: 62G 4.4G 40G 491M 17G 57G
Swap: 15G 727M 15G
Resource pools:
w/memorysize set:
sysquery: 64M, sysdata: 100M, wosdata: 2G, tm: 2G, p_dashboard (custom pool): 8G (cascades to general)
w/maxmemorysize set:
general: 48G, sysdata: 1GB, wosdata: 2G, jvm: 2GB, monitoring: 2GB, blobdata: 10% (not used, we don't run any machine learning).
OOM dmesg logs:
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[331310] 1001 331310 25760521 15801473 42609 3996082 0 vertica
Shows vertica with 60.2GB rss, and 15GB swapents; no other process has even close to 1GB rss.
sysctl: (changes to base AMI:) vm.swappiness=1
Any help would be greatly appreciated.
Comments
Hi,
Did you see the posts about the recommendation to update to Vertica 8.1.1.9 if using 8.1.1.6?
https://forum.vertica.com/discussion/239197/vertica-8-1-1-6
Thanks, we switched to Vertica 8.1.1-10, but promptly got OOM errors a few days later. We've since lowered the general pool max memory size. However, we had some other issues with 8.1.1-10 that caused us to have to switch versions again (to 9.0.0-2). The 8.1.1-10 system would constantly segfault on startup while performing an analyze row task and we couldn't keep the database up.
See: https://forum.vertica.com/discussion/239313/8-1-1-10-segfault-on-startup-after-crash/
What's the size of the metadata pool? (Catalog size)
Anything particularly interesting in your workload? UDx?