Tuple Mover falls behind, resulting in constant growth of ROS containers

Andrew_Bacon · June 2013

Since upgrading from 5.0.8 to 6.1.0 and 6.1.1 we are seeing constant growth in ROS containers after a period of uptime since a DB or node restart. I confirm this by querying the projection_storage table for ros_count. Initially, ROS container volumes are maintained, but over time, my queries to the projection_storage table get slower and slower and ROS volumes grow. The only resolution appears to be to run manual mergeouts using the tuple mover. This reduces the ROS count and also the query duration against the projection_storage table. However, this is only a temporary measure, as ROS containers continue to grow even after a manual mergeout. We did not witness this issue in v5.0.8. The only permanent solution is to restart vertica, upon a restart, the tuple mover reduces the ROS count back to normal levels. Are there known issues with the Tuple Mover lagging behind and causing runaway ROS growth? Could this be due to poor performance of queries against the projection_storage table as the number of ROS containers grows?

[Deleted User] · June 2013

Hi Andrew, Hm... So, I see that you're seeing different behavior with regard to ROS containers in 6.x vs 5.x. I guess my question is, what problem are you seeing? If your specific problem is that queries of the projection_storage table are slow, could you describe your workload a bit more? Perhaps we can provide a workaround. (In that case, I'd also be curious why you're querying that table so much :-) ) It is correct that, in some cases, 6.x will have more ROS containers than 5.x. Typically for performance reasons. For example, see the "Local Segmentation" feature in the documentation. One question: Is the tuple mover just not keeping up?, do you see mergeouts happening at all? Or are there no mergeouts at all? If there are no mergeouts happening at all, something has likely gone wrong with your configuration; please file a support case. If you are seeing mergeouts but the tuple mover is lagging behind in a way that's demonstrably harmful (for example, the ROS-container count is growing to the point where you're hitting or approaching ROS pushback), you could try tuning the tuple mover to be more aggressive: https://my.vertica.com/docs/6.1.x/HTML/index.htm#14361.htm Another thing for you to think about -- Are you using regular COPY, or COPY DIRECT? (Are you loading to WOS or ROS?) Typically you do want to do small loads into WOS first, so that moveout can help them. That said, in 6.x, we added some query-performance optimizations to the WOS that in turn cause data in the WOS to potentially consume more memory per projection. If you are loading to the WOS and if you run out of WOS space (because moveout lagged behind), we automatically switch to COPY DIRECT for that load. (If you want to catch this very quickly, though with perhaps too heavy a hammer, use COPY TRICKLE; in this case Vertica will error rather than switching to COPY DIRECT if the WOS is full.) This can happen because your load rate increased, or potentially because you were right at the edge of using all of WOS with 5.x, and the new 6.x data structures push the WOS just over the edge to 'full'. If that happens for a bunch of loads, you can end up with a ton of little ROS containers, which in turn means a lot more work for mergeout. The solution here is actually earlier in the pipeline, with moveout: either make moveout more aggressive, or increase the amount of memory allocated to the WOS. Both are documented at the link above. Adam

Andrew_Bacon · June 2013

Hi Adam, Thanks for your insight. Predominantly the problem is continuous growth in ROS containers, which (if left unchecked) would undoubtedly cause there to be more than 1024 ROS's per projection per node. We do see mergeouts occurring and for a period of time after a DB or node restart we see that the Tuple Mover is keeping control of the volume of ROS containers. What first alerted us to the problem was that we regularly (every 3 hours) query the projection_storage table to determine the average ros_count. This is as part of our regular monitoring of ROS containers (due to a previous "replay-delete" issue that meant we felt we should put monitoring of ROS containers in place). Over time, the duration of this query increases, as does the volume of ROS containers. I can run manual MergeOuts to reduce the volume of ROS containers, but once completed the number of ROS's again begin to climb continuously. My attempts to tune the Tuple Mover have been fruitless, I've given the WOS resource pool more memory, increased plannedconcurreny for TM and reduced the MoveOut and MergeOut intervals. None of these have any positive effect, other than the increased memory size for WOS reducing the number of resource rejections. If anything, tuning the Tuple Mover as suggested by the documentation appears to make the situation worse, with ROS's being created at an increased rate. The only solution to date is to restart node(s), restart the DB, or reboot the boxes, some of which we've had no choice but to do, due to nodes locking up when memory usage increases to 100%. After restart, the ROS volumes begin to decrease, without any manual intervention. We now load predominantly to WOS, with rare overspill to ROS, however until recently we loaded directly to ROS. This change from ROS to WOS has not made any noticeable difference to the problem of continuous ROS growth. I'm currently purging the entire database to determine if the volume of deleted rows is contributing to the growth in ROS containers, however we are not performing widespread deletes, the majority of our workload is loading to WOS via COPY.

[Deleted User] · June 2013

Hi Andrew, Hm... Thanks for the additional detail. That does seem quite odd; it's not something I'm personally familiar with. I'm particularly concerned that the database's memory usage exceeds 100%. That should not be happening in general. Similarly, I don't know offhand of a reason that mergeout should slow down after the system has been up for a while (in a way that would be fixed by simply restarting), unless there's some issue with resource management. It sounds to me like something is not working properly with your installation. I would encourage you to open a support case. (You're certainly welcome to keep asking here as well; others might have some ideas.) One question, which I'm sure support would also ask: Is the problem fixed by simply restarting the Vertica process? Or do you actually have to reboot the whole machine? Adam

Andrew_Bacon · June 2013

Hi Adam, Thanks Adam, we do have ongoing support cases, this being one of them, but this case has hardly progressed since I opened it back in April so I wanted to open it up to the community and find out if anyone else had encountered this situation and had found a way to resolve it. To answer your question, yes, when we restart the vertica process, the memory usage drops, swap usage returns back to near 0 and the ROS volumes begin to drop back to their expected levels. It would be useful to hear if anyone else encounters similar. Andrew

eli_revach · July 2013

Hi All I am facing similar situation during my load process , we are using ongoing load into WOS using JDBC bulks insert , commits on each 1K records , right after vertica restart the avg load rate is grate and after few hours the rate become slow and slow , our ros containers are left behind by mergeout and only manual mergeout , merge them to single ros container ,in our case the manual do the merge to single ros container, but not solve the load performance , only restart of vertica change it to normal time (For few hours until it back to slow) So looks like I am not the only one claim about such behave , please keep me update on any progress you have with the support regarding this Thanks for sharing ,

Andrew_Bacon · July 2013

Hi Eli, Thanks for your input, I think you're experiencing a different issue. We predominantly use COPY to load data, not insert. How many records are you loading per hour? It sound to me like you should switch to using COPY? I'm still pursuing this issue with Vertica support. Currently, the best way to reduce ROS containers is for me to run a "select purge()" on the entire database. This has the effect of temporarily reducing the number of ROS containers, however the volumes start increasing again as soon as the purge is complete. This is my preferred method of keeping ROS containers low, previously I had scripted continuous TM mergeouts of the projections with the highest number of ROS containers. Cheers, Andrew

eli_revach · July 2013

Thanks for replay ,i am using java batch insert which translate internaly to copy command ,this how its being prrsent on vertica,log file ,i will try the purg statmet you run and chech is usfull also i my case. Plase keep me updae if you get some info from support

Andrew_Bacon · July 2014

Just an update here, the issue has been acknowlegded as a bug by Vertica and they had hoped to release a fix in the next version, "Dragline". Unfortunately the fix for this problem will not be in the Dragline release. Hopefully it will be addressed in the following release.

Mariano · August 2014

Hi,
we are using the version 7, and we're having the same problem.
Please, do you know how to fix it or there is any workaround o patch?
Thanks in advanced.

Andrew_Bacon · August 2014

Hi Mariano,

No, there's still no fix or patch we reverted to our workaround whereby we've scripted continuous TM mergeouts of the projections with the highest number of ROS containers.

You can script something basic using the following query to list projections and their ROS counts

SELECT projection_schema || '.' || projection_name, ros_countFROM projection_storage
ORDER BY ros_count DESC
LIMIT 10;

Then get the script to execute a mergeout using this command on the projections with the highest ros counts.

select do_tm_task('MergeOut','<PROJECTION SCHEMA>.<PROJECTION NAME>');

If you have a severe ROS growth problem you may need to execute multiple instances of this script.

Hope this helps

Veronica_1 · September 2014

Hello,

One question about this. The problem with ROS containers could cause a service interruption in any case? or it only affects to performance?

Thanks in advanced.

Andrew_Bacon · September 2014

I've never tested it to find out, but if left unchecked, I believe this would eventually affect your service. Our concern is that you would start to see the "Too many ROS containers" error detailed here

http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/Monitoring/Vertica/Eve...

This causes transactions to be rolled back and prevents data loading, so it's best avoided.

I also imagine that performance will degrade as ROS volumes grow. From my experience, executing the additional MergeOut's do not appear to hinder performance.

We're Moving!

Create My New Community Account Now

Tuple Mover falls behind, resulting in constant growth of ROS containers

Comments

Leave a Comment