Stale checkpoint and too many ROS

pmachinaud · November 2014

Hey guys !

I pretty new to vertica and since 2 weeks, my new job consist in maintaining the vertica database.

Here we are in "stand alone mode" (only 1 node).

The vertica has heavily work - loaded/updated/deleted every 5-15 minutes on multiple schemas.
The vertica is heavily read in buisness hours.

Our vertica server has 32 Go RAM, 2*CPU Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + hyperthread.
It's running Vertica Analytic Database v6.1.2-0

Issue with ROS:

We have frequently this issue happening during heavy workload:
Attempted to Create Too Many ROS Containers.

As manual action we do : SELECT DO_TM_TASK('mergeout', 't1');
But the issue happened the day after.
As workaround : we switched to COPY LOCAL command for our bulk loads.
But the issue happen the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 300 and 150.
We added TM thread from 3 to 5.
Memory Size - before : 200MB , after 400MB
But the issue happened the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 150 and 70.

We are planing to add the double of RAM to this server.

Is there a real workaround to avoid this issue ?

Issue with Stale Checkpoint:

As I planned to do monitoring on vertica, I made SELECT * FROM v_monitor.monitoring_events LIMIT 100;
I found this kind of error :

0 Warning 2014-11-05 09:26:45 2014-11-05 09:36:45 Stale Checkpoint Node v_akabi_node0001 has data 1032429 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694778)
0 Warning 2014-11-05 09:21:45 2014-11-05 09:31:45 Stale Checkpoint Node v_akabi_node0001 has data 1032129 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694757)

All I can find about this issue is that we have to monitor it.

I tried to do a MoveOut global in order to get rid of every projections in WOS, but it didn't fixed the issue.

But what should we do if this happen ?

------
Update :

I made a check and found :

select * from system;

current_epoch ahm_epoch last_good_epoch
697602 668639 668639

It may be related to this.

So I've just executed :
select make_ahm_now(); [ It takes a lot of time !!!!! will se the result tomorow]

And will purge all tables.

Hope the data remaining in WOS will be "movedout" with previous command (And I will avoid stale issue)

And Hope that the purge_action will help me to get rid to "Attempted to Create Too Many ROS Containers."

---------------------
Any help about this will be apreciable.

thanks in advance,

Regards,

Philippe

Prasanta_Pal · November 2014

Use COPY /*+direct*/ or INSERT /*+direct*/ for bulk load to avoid moveout, it should write to ROS directly for bulk load.

select * from vs_tuple_mover_operations; to monitor what operations are running.

reducing the interval may not be effective if TM is busy doing moveout/mergeout.

pmachinaud · November 2014

Sorry, i Wrote COPY Local, but forgot to say that we are using COPY /* + DIRECT

As you show me this query :

I can see that there is 2 mergeout in progress since 10Hours

select * from vs_tuple_mover_operations;

Date now : 2014-11-06 18:07

2 Mergeout operations

2014-11-06 07:16:13
2014-11-06 07:16:14

Projection_name - earliest_container_start_epoch - latest_container_start_epoch - ros_count - plan_type

tbl_tableau_source_cam_full_query_super 659273 691427 32 5432787664 Replay Delete

tbl_tableau_source_cam_full_query_super 672504 696555 32 7979841327 Replay Delete

3 other since 3 hours

1 others since 2 hours

Why does thoses mergeout are so long ?

Prasanta_Pal · November 2014

It looks like mergeout ( Replay delete) for the projection "tbl_tableau_source_cam_full_query_super" got stuck.

get the session id from the the sessions table for this transactions and close that session.
And then do
SELECT DO_TM_TASK('mergeout', '<projection-name>');

pmachinaud · November 2014

Hi,

Finally, when I came back to work this morning, to look at TM (select * from vs_tuple_mover_operations;)

Thoses mergeout was finished, but still remaining a long moveout due to my select make_ahm_now(); still in running process.

This morning I found again the error : too many rows, and the stale checkpoint is still remaining too.

Regards,

pmachinaud · November 2014

And I can see that the moveout process is still stuck on the same table since the moment I executed select make_ahm_now();

Prasanta_Pal · November 2014

If there are tuple mover operations which are Replay Delete, meaning you are doing frequent deletes/updates, these are slow. Need to see the vertica documentation for optimizing delete performance.

pmachinaud · November 2014

Ok, I'll check with the developper to see with wich column he checks for his update/deletes.

And you think it could issue a stale checkpoint ?

Regards,

Philippe

We're Moving!

Create My New Community Account Now

Stale checkpoint and too many ROS

Comments

Leave a Comment