At what level of swap and memory usage should I have alerts triggered?

We have a 10 nodes cluster. We have nagios monitoring all nodes. By default we trigger alerts when a system uses more than 90% RAM and 20% SWAP. We noticed that our nodes will start using SWAP faster than RAM (39% RAM and 25% SWAP - I know odd). Therefore, what would be the recommended alert thresholds for both?

Comments

  • Hi Christophe, Vertica should not generally be swapping at all. It can potentially use all available system RAM. And Linux can occasionally move small amounts of data out to swap proactively, so I would not set a threshold of 0%. But queries will scale back their memory usage, or queue up and wait for more memory to be available, if insufficient RAM is available. However, Vertica assumes that it's the only significant memory-consuming application on the system. If you're seeing a lot of swap usage, is it possible that you have another process that occasionally starts up, eats a lot of RAM, and then exits? Maybe rather quickly, so that Nagios doesn't notice it -- Nagios typically uses a polling model; a brief burst between Nagios samples can go unnoticed. If you need to run a large application alongside Vertica, you can look into Vertica Resource Pools to define smaller memory-usage limits.
  • We too have noticed swap usage with our Vertica 6.1.0 and 6.1.1 clusters. When the vertica process is started on a node, no swap is allocated, then over time, swap usage starts to grow until it fills completely the 2GB that is allocated to swap. There are no other significant processes running on the systems that should be consuming swap space and restarting the vertica process clears the allocated swap usage. After a period of time we also notice that ROS containers begin to grow (rather than be kept under control by the Tuple Mover) and this affects performance, in particular the response time of queries to the projection_storage system table. Under what circumstances would Vertica begin utilising swap?
  • Hi Andrew, Having looked into this more since my last post: The simplest answer is that Vertica never explicltly swaps; swapping is up to the Linux kernel. But Vertica is never intended to try to use more system memory than is available. To that end, two questions: - What do your hardware specs look like?; in particular, do you meet Vertica's minimum RAM requirements of 2gb/core? (If not, Vertica may swap.) - What does the "Swappiness" kernel parameter look like on your cluster?: http://en.wikipedia.org/wiki/Swappiness (This parameter can cause Linux to choose to swap out data in order to have a larger filesystem cache, even if there is enough RAM that there's no actual need to utilize swap yet. Some Linux distributions set this; in some use cases it can be beneficial.) (The usual third question, and most common issue in the field, is "are you running anything else on the same machines?" But you've already responded to that.) If you see Vertica using more memory than about 95% of system RAM (or equivalent if some swap is being used) at any given moment, then that really would sound like a Vertica issue. In that case, please file a support case. And, if at all possible, please include details about what operations were running on the cluster at the time. Regarding ROS containers: System tables are not stored in ROS containers. They actually use a completely separate storage format and mechanism from data in other tables, and as such are not subject to the tuple mover, etc. In particular, the maximum amount of data stored for each system table (or at least, for each underlying dc_* table) is a setting that you can adjust: https://my.vertica.com/docs/6.1.x/HTML/index.htm#16142.htm If you're finding that the retention policy is more aggressive than you would like, you're welcome to adjust it. (We aggressively work to minimize the system-resource requirements of maintaining and updating system tables. Unfortunately, this has made it relatively expensive to query these tables, as we spend many fewer resources optimizing them than we do ordinary tables. It's a known issue that we're continuing to work on.)
  • Hi Adam, Thanks for your feedback, to answer your questions: We have 128GB nodes with 32 cores per node and also a cluster with 72GB nodes each with 24 cores. Swappiness is 60, however /proc/sys/vm/overcommit_memory has been set to zero. Is it safe to set to swappiness to 0 also? Under normal operation, our vertica nodes do not use more than 40% of available system RAM.
  • Hi Andrew, Thanks for the reply. Regarding swappiness, I suspect that's a big part of what's going on. Swappiness is not constrained by overcommit_memory in that Linux doesn't count it as overcommitting to cache extra pages in its buffer pool (because those pages could be freed at any time if Linux needs to swap program code back in). I'm not qualified to comment on what is and isn't safe as a swappiness setting. Perhaps others could speak up here. I will say that Vertica isn't written to expect any particular value for swappiness. You might want to check on a Linux forum first regarding other impact. I would expect the only impact to be performance-related. It's not clear to me that the impact would be entirely positive; very workload-dependent. Adam
  • Hi Adam, Thanks, we're going to set the swappiness to 0 on one of our cluster nodes and assess its impact. Cheers, Andrew

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file