We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Stale checkpoint and too many ROS — Vertica Forum

Stale checkpoint and too many ROS

Hey guys !

I pretty new to vertica and since 2 weeks, my new job consist in maintaining the vertica database.

Here we are in "stand alone mode" (only 1 node).

The vertica has heavily work - loaded/updated/deleted every 5-15 minutes on multiple schemas.
The vertica is heavily read in buisness hours.

Our vertica server has 32 Go RAM, 2*CPU Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + hyperthread.
It's running Vertica Analytic Database v6.1.2-0

Issue with ROS:

We have frequently this issue happening during heavy workload:
Attempted to Create Too Many ROS Containers.

As manual action we do : SELECT DO_TM_TASK('mergeout', 't1');
But the issue happened the day after.
As workaround : we switched to COPY LOCAL command for our bulk loads.
But the issue happen the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 300 and 150.
We added TM thread from 3 to 5.
Memory Size - before : 200MB , after 400MB
But the issue happened the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 150 and 70.

We are planing to add the double of RAM to this server.

Is there a real workaround to avoid this issue ?

Issue with Stale Checkpoint:

As I planned to do monitoring on vertica, I made SELECT * FROM v_monitor.monitoring_events LIMIT 100;
I found this kind of error :

0 Warning 2014-11-05 09:26:45 2014-11-05 09:36:45 Stale Checkpoint Node v_akabi_node0001 has data 1032429 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694778)
0 Warning 2014-11-05 09:21:45 2014-11-05 09:31:45 Stale Checkpoint Node v_akabi_node0001 has data 1032129 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694757)

All I can find about this issue is that we have to monitor it.

I tried to do a MoveOut global in order to get rid of every projections in WOS, but it didn't fixed the issue.

But what should we do if this happen ?

------
Update :

I made a check and found :

select * from system;

current_epoch   ahm_epoch    last_good_epoch
697602              668639           668639


It may be related to this.

So I've just executed :
select make_ahm_now(); [ It takes a lot of time !!!!!  will se the result tomorow]

And will purge all tables.

Hope the data remaining in WOS will be "movedout" with previous command (And I will avoid stale issue)

And Hope that the purge_action will help me to  get rid to "Attempted to Create Too Many ROS Containers."

---------------------
Any help about this will be apreciable.


thanks in advance,

Regards,

Philippe





Comments

  • Prasanta_PalPrasanta_Pal - Select Field - Employee
    Use COPY /*+direct*/ or INSERT /*+direct*/ for bulk load to avoid moveout, it should write to ROS directly for bulk load.

    select * from vs_tuple_mover_operations; to monitor what operations are running.

    reducing the interval may not be effective if  TM  is busy doing moveout/mergeout.

  • Sorry, i Wrote COPY Local, but forgot to say that we are using COPY /* + DIRECT

    As you show me this query :

    I can see that there is 2 mergeout in progress since 10Hours


    select * from vs_tuple_mover_operations;

    Date now : 2014-11-06 18:07

    2 Mergeout operations

    2014-11-06 07:16:13
    2014-11-06 07:16:14

    Projection_name - earliest_container_start_epoch - latest_container_start_epoch - ros_count -  plan_type

    tbl_tableau_source_cam_full_query_super    659273    691427    32    5432787664    Replay Delete

    tbl_tableau_source_cam_full_query_super    672504    696555    32    7979841327    Replay Delete
    3 other since 3 hours  
    1 others since 2 hours

    Why does thoses mergeout are so long ?


  • Prasanta_PalPrasanta_Pal - Select Field - Employee
    It looks like mergeout ( Replay delete) for the projection "tbl_tableau_source_cam_full_query_super" got stuck.

    get the session id from the the sessions table for this transactions and close that session.
    And then do
    SELECT DO_TM_TASK('mergeout', '<projection-name>');

  • Hi,

    Finally, when I came back to work this morning, to look at TM (select * from vs_tuple_mover_operations;)

    Thoses mergeout was finished, but still remaining a long moveout due to my select make_ahm_now(); still in running process.

    This morning I found again the error : too many rows, and the stale checkpoint is still remaining too.

    Regards,
  • And I can see that the moveout process is still stuck on the same table since the moment I executed select make_ahm_now();
  • Prasanta_PalPrasanta_Pal - Select Field - Employee
    If there are tuple mover operations which are Replay Delete, meaning you are doing frequent deletes/updates, these are slow. Need to see the vertica documentation for optimizing delete performance.
  • Ok, I'll check with the developper to see with wich column he checks for his update/deletes.

    And you think it could issue a stale checkpoint ?

    Regards,

    Philippe

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file