Stale checkpoint and too many ROS

Hey guys !

I pretty new to vertica and since 2 weeks, my new job consist in maintaining the vertica database.

Here we are in "stand alone mode" (only 1 node).

The vertica has heavily work - loaded/updated/deleted every 5-15 minutes on multiple schemas.
The vertica is heavily read in buisness hours.

Our vertica server has 32 Go RAM, 2*CPU Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + hyperthread.
It's running Vertica Analytic Database v6.1.2-0

Issue with ROS:

We have frequently this issue happening during heavy workload:
Attempted to Create Too Many ROS Containers.

As manual action we do : SELECT DO_TM_TASK('mergeout', 't1');
But the issue happened the day after.
As workaround : we switched to COPY LOCAL command for our bulk loads.
But the issue happen the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 300 and 150.
We added TM thread from 3 to 5.
Memory Size - before : 200MB , after 400MB
But the issue happened the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 150 and 70.

We are planing to add the double of RAM to this server.

Is there a real workaround to avoid this issue ?

Issue with Stale Checkpoint:

As I planned to do monitoring on vertica, I made SELECT * FROM v_monitor.monitoring_events LIMIT 100;
I found this kind of error :

0 Warning 2014-11-05 09:26:45 2014-11-05 09:36:45 Stale Checkpoint Node v_akabi_node0001 has data 1032429 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694778)
0 Warning 2014-11-05 09:21:45 2014-11-05 09:31:45 Stale Checkpoint Node v_akabi_node0001 has data 1032129 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694757)

All I can find about this issue is that we have to monitor it.

I tried to do a MoveOut global in order to get rid of every projections in WOS, but it didn't fixed the issue.

But what should we do if this happen ?

------
Update :

I made a check and found :

select * from system;

current_epoch   ahm_epoch    last_good_epoch
697602              668639           668639


It may be related to this.

So I've just executed :
select make_ahm_now(); [ It takes a lot of time !!!!!  will se the result tomorow]

And will purge all tables.

Hope the data remaining in WOS will be "movedout" with previous command (And I will avoid stale issue)

And Hope that the purge_action will help me to  get rid to "Attempted to Create Too Many ROS Containers."

---------------------
Any help about this will be apreciable.


thanks in advance,

Regards,

Philippe





Comments

  • Use COPY /*+direct*/ or INSERT /*+direct*/ for bulk load to avoid moveout, it should write to ROS directly for bulk load.

    select * from vs_tuple_mover_operations; to monitor what operations are running.

    reducing the interval may not be effective if  TM  is busy doing moveout/mergeout.

  • Sorry, i Wrote COPY Local, but forgot to say that we are using COPY /* + DIRECT

    As you show me this query :

    I can see that there is 2 mergeout in progress since 10Hours


    select * from vs_tuple_mover_operations;

    Date now : 2014-11-06 18:07

    2 Mergeout operations

    2014-11-06 07:16:13
    2014-11-06 07:16:14

    Projection_name - earliest_container_start_epoch - latest_container_start_epoch - ros_count -  plan_type

    tbl_tableau_source_cam_full_query_super    659273    691427    32    5432787664    Replay Delete

    tbl_tableau_source_cam_full_query_super    672504    696555    32    7979841327    Replay Delete
    3 other since 3 hours  
    1 others since 2 hours

    Why does thoses mergeout are so long ?


  • It looks like mergeout ( Replay delete) for the projection "tbl_tableau_source_cam_full_query_super" got stuck.

    get the session id from the the sessions table for this transactions and close that session.
    And then do
    SELECT DO_TM_TASK('mergeout', '<projection-name>');

  • Hi,

    Finally, when I came back to work this morning, to look at TM (select * from vs_tuple_mover_operations;)

    Thoses mergeout was finished, but still remaining a long moveout due to my select make_ahm_now(); still in running process.

    This morning I found again the error : too many rows, and the stale checkpoint is still remaining too.

    Regards,
  • And I can see that the moveout process is still stuck on the same table since the moment I executed select make_ahm_now();
  • If there are tuple mover operations which are Replay Delete, meaning you are doing frequent deletes/updates, these are slow. Need to see the vertica documentation for optimizing delete performance.
  • Ok, I'll check with the developper to see with wich column he checks for his update/deletes.

    And you think it could issue a stale checkpoint ?

    Regards,

    Philippe

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file