Stale checkpoint and too many ROS
Hey guys !
I pretty new to vertica and since 2 weeks, my new job consist in maintaining the vertica database.
Here we are in "stand alone mode" (only 1 node).
The vertica has heavily work - loaded/updated/deleted every 5-15 minutes on multiple schemas.
The vertica is heavily read in buisness hours.
Our vertica server has 32 Go RAM, 2*CPU Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + hyperthread.
It's running Vertica Analytic Database v6.1.2-0
Issue with ROS:
We have frequently this issue happening during heavy workload:
Attempted to Create Too Many ROS Containers.
As manual action we do : SELECT DO_TM_TASK('mergeout', 't1');
But the issue happened the day after.
As workaround : we switched to COPY LOCAL command for our bulk loads.
But the issue happen the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 300 and 150.
We added TM thread from 3 to 5.
Memory Size - before : 200MB , after 400MB
But the issue happened the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 150 and 70.
We are planing to add the double of RAM to this server.
Is there a real workaround to avoid this issue ?
Issue with Stale Checkpoint:
As I planned to do monitoring on vertica, I made SELECT * FROM v_monitor.monitoring_events LIMIT 100;
I found this kind of error :
0 Warning 2014-11-05 09:26:45 2014-11-05 09:36:45 Stale Checkpoint Node v_akabi_node0001 has data 1032429 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694778)
0 Warning 2014-11-05 09:21:45 2014-11-05 09:31:45 Stale Checkpoint Node v_akabi_node0001 has data 1032129 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694757)
All I can find about this issue is that we have to monitor it.
I tried to do a MoveOut global in order to get rid of every projections in WOS, but it didn't fixed the issue.
But what should we do if this happen ?
------
Update :
I made a check and found :
select * from system;
current_epoch ahm_epoch last_good_epoch
697602 668639 668639
It may be related to this.
So I've just executed :
select make_ahm_now(); [ It takes a lot of time !!!!! will se the result tomorow]
And will purge all tables.
Hope the data remaining in WOS will be "movedout" with previous command (And I will avoid stale issue)
And Hope that the purge_action will help me to get rid to "Attempted to Create Too Many ROS Containers."
---------------------
Any help about this will be apreciable.
thanks in advance,
Regards,
Philippe
I pretty new to vertica and since 2 weeks, my new job consist in maintaining the vertica database.
Here we are in "stand alone mode" (only 1 node).
The vertica has heavily work - loaded/updated/deleted every 5-15 minutes on multiple schemas.
The vertica is heavily read in buisness hours.
Our vertica server has 32 Go RAM, 2*CPU Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + hyperthread.
It's running Vertica Analytic Database v6.1.2-0
Issue with ROS:
We have frequently this issue happening during heavy workload:
Attempted to Create Too Many ROS Containers.
As manual action we do : SELECT DO_TM_TASK('mergeout', 't1');
But the issue happened the day after.
As workaround : we switched to COPY LOCAL command for our bulk loads.
But the issue happen the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 300 and 150.
We added TM thread from 3 to 5.
Memory Size - before : 200MB , after 400MB
But the issue happened the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 150 and 70.
We are planing to add the double of RAM to this server.
Is there a real workaround to avoid this issue ?
Issue with Stale Checkpoint:
As I planned to do monitoring on vertica, I made SELECT * FROM v_monitor.monitoring_events LIMIT 100;
I found this kind of error :
0 Warning 2014-11-05 09:26:45 2014-11-05 09:36:45 Stale Checkpoint Node v_akabi_node0001 has data 1032429 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694778)
0 Warning 2014-11-05 09:21:45 2014-11-05 09:31:45 Stale Checkpoint Node v_akabi_node0001 has data 1032129 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694757)
All I can find about this issue is that we have to monitor it.
I tried to do a MoveOut global in order to get rid of every projections in WOS, but it didn't fixed the issue.
But what should we do if this happen ?
------
Update :
I made a check and found :
select * from system;
current_epoch ahm_epoch last_good_epoch
697602 668639 668639
It may be related to this.
So I've just executed :
select make_ahm_now(); [ It takes a lot of time !!!!! will se the result tomorow]
And will purge all tables.
Hope the data remaining in WOS will be "movedout" with previous command (And I will avoid stale issue)
And Hope that the purge_action will help me to get rid to "Attempted to Create Too Many ROS Containers."
---------------------
Any help about this will be apreciable.
thanks in advance,
Regards,
Philippe
0
Comments
select * from vs_tuple_mover_operations; to monitor what operations are running.
reducing the interval may not be effective if TM is busy doing moveout/mergeout.
As you show me this query :
I can see that there is 2 mergeout in progress since 10Hours
select * from vs_tuple_mover_operations;
Date now : 2014-11-06 18:07
2 Mergeout operations
2014-11-06 07:16:13
2014-11-06 07:16:14
Projection_name - earliest_container_start_epoch - latest_container_start_epoch - ros_count - plan_type
tbl_tableau_source_cam_full_query_super 659273 691427 32 5432787664 Replay Delete
tbl_tableau_source_cam_full_query_super 672504 696555 32 7979841327 Replay Delete 1 others since 2 hours
Why does thoses mergeout are so long ?
get the session id from the the sessions table for this transactions and close that session.
And then do
SELECT DO_TM_TASK('mergeout', '<projection-name>');
Finally, when I came back to work this morning, to look at TM (select * from vs_tuple_mover_operations;)
Thoses mergeout was finished, but still remaining a long moveout due to my select make_ahm_now(); still in running process.
This morning I found again the error : too many rows, and the stale checkpoint is still remaining too.
Regards,
And you think it could issue a stale checkpoint ?
Regards,
Philippe