"ERROR 5065: Too many ROS containers" during ETL load
PROBLEM:
Our ETL process started throwing the following error today:
ERROR 5065: Too many ROS containers exist for the following projections: prod.Dim_UserAttributes_b0 (limit = 11264, ROS files = 561, DV files = 10704, new files = 2) HINT: Please wait for the tuple mover to catch up. Use 'select * from v_monitor.tuple_mover_operations;’ to monitor
I kept an eye on the v_monitor.tuple_mover_operations table, but the count doesn't seem to be shrinking. Need to resolve this ASAP as we aren't able to load data into our analytics system.
SOLUTION:
Determined from the error message that there were a lot of delete vectors. Customer ran purge_table to clean them up. I provided possible causes and recommendations on how to monitor and control it is is a recurring issue vs. a one time spike. Detail below.
The error you are getting indicates that there was a large delete or update.
ERROR 5065: Too many ROS containers exist for the following projections: prod.Dim_UserAttributes_b0 (limit = 11264, ROS files = 561, DV files = 10704, new files = 2) HINT: Please wait for the tuple mover to catch up. Use 'select * from v_monitor.tuple_mover_operations;’ to monitor.
The DV files are delete vectors which are the result of a delete or update (which is a delete and insert). So the limit is being hit because those
have not been purged and still count towards the overall ROS count.
Recommended steps:
1. See if the ahm is close to current by in vsql doing "select * from system;" and see if the ahm epoch is reasonably close (within a few) to the current
epoch. If not then send the output for review. We may need to run make_ahm_now() to force it current or pursue why the ahm is lagging (could be data in
WOS not getting moved out or a projection not refreshed).
** customer checked and the ahm was only 9 behind current
2. If ahm looks ok then use the purge_table function to purge the delete vectors. You can get a before and after count on the system table delete_vectors to confirm they got purged. Only records older than the ahm will get purged, so that's the reason for checking the ahm first.
** customer ran purge_table and it cleaned up the delete_vectors
3. Look at what other tables have a large delete_vector count and purge them too.
Causes and suggestions:
Deletes or updates create delete vectors (marked for deletion records). Mergeouts will purge them, but mergeouts run at whatever the mergeoutinterval is set to (default is every 10 min), and not every projection is merged each pass. So if the delete/updates happen faster than the mergeouts occur on the projections in question then they build up. Also, each table's mergeout eligibility is somewhat controlled by number of columns and max ros containers per strata. Basically it's like a pyramid. You need to have a certain number at the base before you can start the next layer. So you need to reach a limit at the current layer before mergeout picks up that projection as needing processing. You can see the layers and limits in the system tables strata and strata_structures.
If you do a lot of deletes and/or updates, and your automation allows, you could build a purge_table into your process to clean up after the data is
committed.
If under normal conditions mergeouts keep up but occasionally you have abnormal delete/update activity then likewise, you could build
purge_table into the process.
You can also increase the mergeout frequency by decreasing the mergeoutinterval configuration parameter, but make sure not to get it running so often it conflicts with your normal workload. This is ok for evenly distributed increase in deletes/updates but not a good solution to handle irregular spikes.
So in summary, work the purge_table into the process if possible. Monitor the delete_vectors table for large number of rows for any given projection,
and run purge_table as necessary to keep ahead of it. As a last resort increase the frequency of the mergeoutinterval (maybe drop from 10 to 9 or 8 min) if the deletes/updates are consistently happening vs. one time spikes.
Our ETL process started throwing the following error today:
ERROR 5065: Too many ROS containers exist for the following projections: prod.Dim_UserAttributes_b0 (limit = 11264, ROS files = 561, DV files = 10704, new files = 2) HINT: Please wait for the tuple mover to catch up. Use 'select * from v_monitor.tuple_mover_operations;’ to monitor
I kept an eye on the v_monitor.tuple_mover_operations table, but the count doesn't seem to be shrinking. Need to resolve this ASAP as we aren't able to load data into our analytics system.
SOLUTION:
Determined from the error message that there were a lot of delete vectors. Customer ran purge_table to clean them up. I provided possible causes and recommendations on how to monitor and control it is is a recurring issue vs. a one time spike. Detail below.
The error you are getting indicates that there was a large delete or update.
ERROR 5065: Too many ROS containers exist for the following projections: prod.Dim_UserAttributes_b0 (limit = 11264, ROS files = 561, DV files = 10704, new files = 2) HINT: Please wait for the tuple mover to catch up. Use 'select * from v_monitor.tuple_mover_operations;’ to monitor.
The DV files are delete vectors which are the result of a delete or update (which is a delete and insert). So the limit is being hit because those
have not been purged and still count towards the overall ROS count.
Recommended steps:
1. See if the ahm is close to current by in vsql doing "select * from system;" and see if the ahm epoch is reasonably close (within a few) to the current
epoch. If not then send the output for review. We may need to run make_ahm_now() to force it current or pursue why the ahm is lagging (could be data in
WOS not getting moved out or a projection not refreshed).
** customer checked and the ahm was only 9 behind current
2. If ahm looks ok then use the purge_table function to purge the delete vectors. You can get a before and after count on the system table delete_vectors to confirm they got purged. Only records older than the ahm will get purged, so that's the reason for checking the ahm first.
** customer ran purge_table and it cleaned up the delete_vectors
3. Look at what other tables have a large delete_vector count and purge them too.
Causes and suggestions:
Deletes or updates create delete vectors (marked for deletion records). Mergeouts will purge them, but mergeouts run at whatever the mergeoutinterval is set to (default is every 10 min), and not every projection is merged each pass. So if the delete/updates happen faster than the mergeouts occur on the projections in question then they build up. Also, each table's mergeout eligibility is somewhat controlled by number of columns and max ros containers per strata. Basically it's like a pyramid. You need to have a certain number at the base before you can start the next layer. So you need to reach a limit at the current layer before mergeout picks up that projection as needing processing. You can see the layers and limits in the system tables strata and strata_structures.
If you do a lot of deletes and/or updates, and your automation allows, you could build a purge_table into your process to clean up after the data is
committed.
If under normal conditions mergeouts keep up but occasionally you have abnormal delete/update activity then likewise, you could build
purge_table into the process.
You can also increase the mergeout frequency by decreasing the mergeoutinterval configuration parameter, but make sure not to get it running so often it conflicts with your normal workload. This is ok for evenly distributed increase in deletes/updates but not a good solution to handle irregular spikes.
So in summary, work the purge_table into the process if possible. Monitor the delete_vectors table for large number of rows for any given projection,
and run purge_table as necessary to keep ahead of it. As a last resort increase the frequency of the mergeoutinterval (maybe drop from 10 to 9 or 8 min) if the deletes/updates are consistently happening vs. one time spikes.
0
Comments
Lots of delete vectors can slow down the performance of your queries, as vertica reads the deleted data too while scanning for data. You might not want your queries to scan data which is already deleted from database and waste precious query time.
Plus, It also occupies the extra space in the database,
So From performance and database footprint perspective its always good to have regular cycles of purge depending on the % of delete vectors.
Hope this helps