Vertica and oom

Hi, I have a Vertica 6.1.2 cluster, with 7 nodes with 128GB memory each. Since a month or so (WIthout Vertica update in the meantime), the oom killer regularly decides that Vertica on a host must be killed. This does not happen on al hosts, but most hosts are affected at one point or another. I know that I could play with oom_adj_score, but this is per process, so it needs to be adjusted at each restart and is just dirty. I am trying to find a better solution. Vertica has been restarted on all hosts, so this is not an old memory leak. I set up the general pool to only use 75% of the memory (before the restart), hoping it would alleviate the issue but it did not help. The relevant log lines from oom are are below. From this I understand that teh swap is full (but it does not matter as it basically is useless because very tiny compared to the memory) and that the 'normal' memory is below its min mark, which is the reason for the oom killer kicking in. I wonder if this has already be seen, and what could be a solution. Thanks for any help,
  warning kernel: active_anon:15562290 inactive_anon:745732 isolated_anon:0  warning kernel: active_file:1348456 inactive_file:1562279 isolated_file:0  warning kernel: unevictable:0 dirty:126 writeback:0 unstable:0  warning kernel: free:13454989 slab_reclaimable:110377 slab_unreclaimable:20001  warning kernel: mapped:1085 shmem:21 pagetables:34068 bounce:0  warning kernel: Node 0 DMA free:15592kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15180kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes  warning kernel: lowmem_reserve[]: 0 1943 64563 64563  warning kernel: Node 0 DMA32 free:251824kB min:1352kB low:1688kB high:2028kB active_anon:266144kB inactive_anon:389756kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1989640kB mlocked:0kB dirty:20kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:40kB slab_unreclaimable:216kB kernel_stack:0kB pagetables:276kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no  warning kernel: lowmem_reserve[]: 0 0 62620 62620  warning kernel: Node 0 Normal free:43652kB min:43668kB low:54584kB high:65500kB active_anon:61808336kB inactive_anon:2575340kB active_file:1908kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:64122880kB mlocked:0kB dirty:276kB writeback:0kB mapped:496kB shmem:16kB slab_reclaimable:18360kB slab_unreclaimable:53884kB kernel_stack:4712kB pagetables:122172kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:6184 all_unreclaimable? no  warning kernel: lowmem_reserve[]: 0 0 0 0  warning kernel: Node 0 DMA: 2*4kB 0*8kB 2*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15592kB  warning kernel: Node 0 DMA32: 37*4kB 16*8kB 22*16kB 16*32kB 9*64kB 158*128kB 246*256kB 156*512kB 81*1024kB 2*2048kB 0*4096kB = 251828kB  warning kernel: Node 0 Normal: 2164*4kB 881*8kB 396*16kB 206*32kB 117*64kB 39*128kB 14*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44696kB  warning kernel: 2912367 total pagecache pages  warning kernel: 1629 pages in swap cache  warning kernel: Swap cache stats: add 210179, delete 208550, find 444184/451180  warning kernel: Free swap  = 0kB  warning kernel: Total swap = 524280kB  info kernel: 33554416 pages RAM  info kernel: 523103 pages reserved  info kernel: 2908628 pages shared  info kernel: 16663531 pages non-shared  

Comments

  • Hi Guillaume, I don't immediately know the answer here, but could you post one more bit of information -- how much memory is the Vertica process using on your system? (And how much memory are other processes on your system using, specifically when the OOM-kill occurs?) Also, does it look like there's a leak?; does Vertica's memory usage grow (and continue growing past the resource-pool limit) when some particular operation is performed? (If you're running the Management Console, this should be nicely graphed for you. You can probably also pull the information out of Vertica's Data Collector tables -- see the documentation and the "data_collector" system table.) The most common cause of this issue is when some little side process on the cluster that you know for sure can't be using that much memory, suddenly for a moment tries to use that much memory :-) Or, relatedly, if something's running on your cluster that you weren't aware of or hadn't previously thought of as significant. If the issue is another process, alternatives to oom_adj_score include setting a memory ulimit on the other process. (Or, where an option, it's always better to tell processes to limit their own memory usage, like you're doing here with Vertica's general pool.) (If on the other hand you find that Vertica is in fact singlehandedly using all RAM and swap, that would surprise me, but please do file a support case with as many details as you have so we can track down what's going on.) Adam
  • Hi Adam, thanks for looking into this! Vertica never seems to use all available memory. We have 128GB per node, and usually I would sya that when an oom occurs, there is less than 100GB used. I do not think that there is a leak. We have monitoring in place (graphite), and I do not see a steady growth over time. Furthermore, all nodes have been restarted recently. I tend to agree with your analysis, that it would make sense that an unexpected process would take a lot of memory. That said, oom displays some info which shows vertica as being by far the highest memory consumer. Swap is fully used, by Vertica only as far as I can tell. But our swap is about 500GB, which is nothing compared to the actual memory. Thanks for you help, Guillaume

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file