Options

Vertica delete issues

Hello,  We are noticing issues in our cluster in that we are keeping deleted records.  While reviewing our logs, we are seeing the Tuple Mover attempt to merge out, but then roll back the transaction shortly after.

Our settings are follow:

HistoryRetentionEpochs is set to -1
HistoryRetentionTime is set to 0
MergeOutInterval is set to 600

---

The roll back log entries we are seeing are:
2014-06-04 16:53:00.002 TM Mergeout(00):0x2b50500458a0-b00000000238eb [Txn] <INFO> Begin Txn: b00000000238eb 'Mergeout: Tuple Mover'
2014-06-04 16:53:00.310 TM Mergeout(00):0x2b50500458a0-b00000000238eb [TM] <INFO> Tuple Mover: nothing to merge out
2014-06-04 16:53:00.332 TM Mergeout(00):0x2b50500458a0-b00000000238eb [TM] <INFO> Tuple Mover: no DV to merge out
2014-06-04 16:53:00.332 TM Mergeout(00):0x2b50500458a0-b00000000238eb [Txn] <INFO> Rollback Txn: b00000000238eb 'Mergeout: Tuple Mover'
2014-06-04 16:53:00.346 TM Mergeout(00):0x2b50500458a0-b00000000238ec [Txn] <INFO> Begin Txn: b00000000238ec 'collectMoveStorageJobs'
2014-06-04 16:53:00.367 TM Mergeout(00):0x2b50500458a0-b00000000238ec [Txn] <INFO> Rollback Txn: b00000000238ec 'collectMoveStorageJobs'
2014-06-04 16:53:00.368 TM Mergeout(00):0x2b50500458a0 [Util] <INFO> Task 'TM Mergeout(00)' enabled

This looks as tho there are no delete vectors to be moved out, however if we query delete vectors, we see quite a few.  Our disk usage is also moving up quite quickly, so we are concerned that we are not tuning correctly.


Another observation is that our last known good epoch is not moving forward, and so we are not in sync with regards to epoch and ahm.


Thanks for any help or insight anyone can provide.
Jess
 

Comments

  • Options
    Some more details:
    select * from system;

     current_epoch | ahm_epoch | last_good_epoch | refresh_epoch | designed_fault_tolerance | node_count | node_down_count | current_fault_tolerance | catalog_revision_number | wos_used_bytes | wos_row_count | ros_used_bytes | ros_row_count | total_used_bytes | total_row_count 

    ---------------+-----------+-----------------+---------------+--------------------------+------------+-----------------+-------------------------+-------------------------+----------------+---------------+----------------+---------------+------------------+-----------------

            156460 |         0 |               0 |            -1 |                        1 |          5 |               0 |                       1 |                  163559 |     2878865408 |       8348879 |  1513703369554 |   43359567485 |    1516582234962 |     43367916364

    (1 row)


    We think we may be facing a similar issue to https://community.vertica.com/vertica/topics/get_last_good_epoch_not_advancing_so_cant_drop_old_proj..., but there do not seem to be any notes on how to force the move out to occur.  As mentioned, we see attempts, but are also seeing rollbacks.


    The specific text from the previous link is

    " On move out of the old data from WOS, LGE advanced. ".  How was this accomplished?  We are concerned as we are in the process of reloading our warehouse and are not seeing the LGE advance, and so if we were to run into troubles our understanding is we'd have to start from the beginning again.  


    Thanks for any advice.


  • Options
    Hi!
    Tuple Mover: nothing to merge out
    Tuple Mover: no DV to merge out
    What do you want to commit?


    On fresh db:
    $ grep -c Rollback vertica.log  
    2384
    Should I worry about this on my laptop (single node cluster)? :)
    Rollback Txn: a0000000000805 'rebalance_cluster(background)'
    daniel=> select * from system;
    -[ RECORD 1 ]------------+----
    current_epoch            | 11
    ahm_epoch                | 10
    last_good_epoch          | 10
    refresh_epoch            | -1
    designed_fault_tolerance | 0
    node_count               | 1
    node_down_count          | 0
    current_fault_tolerance  | 0
    catalog_revision_number  | 102
    wos_used_bytes           | 0
    wos_row_count            | 0
    ros_used_bytes           | 188
    ros_row_count            | 6
    total_used_bytes         | 188
    total_row_count          | 6
  • Options
    Hi Daniel, Thanks for the reply!

    While I have seen over history the existence of Rollbacks, this time is puzzling as its the MergeOut operation, and we do not see the AHM moving forward, or the LGE being set when these transactions complete (or rollback in this case).  In the case you show, it seems that your LGE and AHM are moving forward.  Do you have thoughts in what scenario's the LGE and AHM would not progress?  Is there something more I can check?  While rollbacks are evident, we are wondering if there is a correlation as we are seeing these messages:

    2014-06-05 04:37:37.626 DistCall Dispatch:0x2b2d6c635350-a0000000011832 [Txn] <INFO> Rollback Txn: a0000000011832 'Manage Epochs: Advance AHM'

    Are we chasing a red herring with the Tuple Mover?  Anything else I can check in the logs to figure out why LGE / AHM are not progressing?


    Thanks,
    Jess


  • Options
    >> we do not see the AHM moving forward, or the LGE being set when these transactions complete (or rollback in this case).
    Epoch advanced when commit occurs and do not  advanced on rollback. Im sure its a "feature", because Vertica can do "historical queries" and Retention Time defines how many history to save (in epochs or in seconds). You configured to "manually", i.e all data is saved, no delete vectors purging by Vertica, but only by manual purging. Thats why AHM doesnt advanced.
    Determines how long deleted data is saved (in seconds) as a historical reference. The default is 0, which means that HP Vertica saves historical data only when nodes are down. Once the specified time has passed since the delete, the data is eligible to be purged. Use the -1 setting if you prefer to use <u>HistoryRetentionEpochs </u>for determining which deleted data can be purged.
    AHM

    An abbreviation for Ancient History Mark, AHM is the epoch prior to which historical data can be purged from physical storage.

    >> Anything else I can check in the logs to figure out why LGE / AHM are not progressing?

    Tables:
    • dc_tuple_mover_events
    • vs_tuple_mover_operations
    • tuple_mover_operations
    • resource_rejection_details
    Function:
        select print_next_mergeout_job();

    select do_tm_task('moveout');
    select do_tm_task('mergeout');
    select make_ahm_now();

    Stale Check Points - http://vertica-forums.com/viewtopic.php?f=7&t=1710#p5648
    (Sharon?)

    K-Safety - Does all tables satisfies K-Safety?

  • Options
    Hey Daniel,
      This is perfect!  I think this is going to resolve our issue.  I'm going to do some maintenance work tonight and run the do_tm_task('merge out') during that window and then run a full back up.  I'll post back if this issue is resolved.

    Thanks again Daniel!
    Jess
  • Options
    Two common reasons why AHM wouldn't advance:

    1. Unrefreshed projections:  select * from projections where not is_up_to_date;

    2.  Something is interfering with a moveout - it can't can't a lock for example.  This might be because of a long-running replay delete tuple mover phase - see  tuple_mover_operations where is_executing and/or the locks table.

    3.  If you have stale checkpoints (active_events table) and don't know which projection is causing them, check the storage_containers table for storage_type 'WOS' and see what belongs to an old epoch.

      --Sharon

  • Options
    Hello,
      We were reviewing more before we perform our maintenance, and wanted to verify the statement:

    "Epoch advanced when commit occurs and do not  advanced on rollback. Im sure its a "feature", because Vertica can do "historical queries" and Retention Time defines how many history to save (in epochs or in seconds). You configured to "manually", i.e all data is saved, no delete vectors purging by Vertica, but only by manual purging. Thats why AHM doesnt advanced."

    Our understanding is that we disabled HistoryRetentionEpochs by setting it to -1 (the default), and by setting HistoryRetentionTime to 0, vertica would purge records as it finds them needing to be purged.  Is this accurate, or have we mis-understood the documentation?  

    Thanks again for all your insight!
    Jess
  • Options
    Hello,
     We attempted to run through the process Daniel mentioned, but it completed (we did find some table/projections that we need to tune), but running select make_ahm_epoch() returned the same message as before (see below).

    select make_ahm_now('true');

               make_ahm_now            

    -----------------------------------

     AHM already at LGE (AHM Epoch: 0)

    (1 row)


    While investigating, we found that we do have a stale checkpoint as mentioned by Sharon as being potential reason for this behavior, but looking over the table that was created (and similar tables being loaded in the time window), we do not see anything that stands out as an issue.  If we were to remove those tables and reload, would that help in this situation.


    dbadmin=> select * from active_events where event_code_description='Stale Checkpoint';

        node_name    | event_code | event_id | event_severity |    event_posted_timestamp     |       event_expiration        | event_code_description |                                       event_problem_description                                       | reporting_node  | event_sent_to_channels | event_posted_count 

    -----------------+------------+----------+----------------+-------------------------------+-------------------------------+------------------------+-------------------------------------------------------------------------------------------------------+-----------------+------------------------+--------------------

     v_gadw_node0001 |         15 |        0 | Warning        | 2014-05-31 19:39:07.008033+00 | 2014-06-07 04:04:57.01115+00  | Stale Checkpoint       | Node v_gadw_node0001 has data 3750 seconds old which has not yet made it to disk. (LGE: 0 Last: 1370) | v_gadw_node0001 | Vertica Log            |               1828

     v_gadw_node0002 |         15 |        0 | Warning        | 2014-05-31 19:39:07.015239+00 | 2014-06-07 04:05:18.017214+00 | Stale Checkpoint       | Node v_gadw_node0002 has data 3750 seconds old which has not yet made it to disk. (LGE: 0 Last: 1370) | v_gadw_node0002 | Vertica Log            |               1828

     v_gadw_node0004 |         15 |        0 | Warning        | 2014-05-31 19:39:07.007677+00 | 2014-06-07 04:04:45.00903+00  | Stale Checkpoint       | Node v_gadw_node0004 has data 3750 seconds old which has not yet made it to disk. (LGE: 0 Last: 1370) | v_gadw_node0004 | Vertica Log            |               1828

     v_gadw_node0005 |         15 |        0 | Warning        | 2014-05-31 19:39:07.009135+00 | 2014-06-07 04:04:50.01101+00  | Stale Checkpoint       | Node v_gadw_node0005 has data 3750 seconds old which has not yet made it to disk. (LGE: 0 Last: 1370) | v_gadw_node0005 | Vertica Log            |               1828

    (4 rows)


    dbadmin=> 


    We took the date / time and searched with-in our logs, looking for the potential culprit, but the COPY statements being used were all using the DIRECT keyword and so our assumption was that it wrote directly to ROS.


    v_gadw_node0001_catalog]$ grep -A1 COPY check_this |grep stage_market|sed -e "s/^ *//g" |sort|uniq -c|sort -nr

       1227    stage_market.amazon_search_terms (

         42    stage_market.amazon_market_basket_analysis (

         42 COPY stage_market.amazon_alt_purchase (

         38    stage_market.amazon_item_view_pre_purchase (

         22    stage_market.amazon_cust_purchase_behavior (

          2 stage_market.dfa_standard_rpt_campaign_agg_stats(


    Per your suggestion, we did check the storage_containers and we were unable to find any containers of type WOS.

    Please let us know if there is anything further we can test.  We'll keep reviewing on our end and advise the thread if we find anything else that stands out.

    Thanks again,
    Jess


  • Options
    Hi Jess!
    K-Safety - Does all tables satisfies K-Safety?
    Do you have "node down"? Does database is in compliance with respect to K-Safety? Where is node 3 or it's just a enumeration (remove - 3/add - 5)?

    PS
    If you have node down and projection that segmented on this node, you must to redistribute your data&colon; on all nodes except a down node. LGE can't be advanced when node down has "data" on it.
    If so => thats why LGE isn't advanced, that's why you get "Stale Check Point"


  • Options
    Hey Daniel,
       No nodes are down and functionally everything seems sane:

    dbadmin=> select node_name,node_state,is_ephemeral from nodes;

        node_name    | node_state | is_ephemeral 

    -----------------+------------+--------------

     v_gadw_node0001 | UP         | f

     v_gadw_node0002 | UP         | f

     v_gadw_node0003 | UP         | f

     v_gadw_node0004 | UP         | f

     v_gadw_node0005 | UP         | f

    (5 rows)


    dbadmin=> 


    K-Safety is set to 1 for the cluster.  Is there something more I can check in that regard?


    Thanks,

    Jess




  • Options
    Jesse - does the PROJECTION_CHECKPOINT_EPOCHS table show any projections with a 0 checkpoint_epoch (or anything abnormal looking)?

    select * from projection_checkpoint_epochs order by checkpoint_epoch;

      --Sharon

  • Options
    Hi Sharon,
      Yes, there are 227 entries that are at 0.

    dbadmin=> select count(*) from projection_checkpoint_epochs where checkpoint_epoch=0;

     count 

    -------

       227

    (1 row)


    dbadmin=> 


    Thanks,

    Jess


  • Options
    Are the zero checkpoint_epochs on all nodes or a subset of nodes?

    Does the DELETE_VECTORS table show these projections as having DVs in the WOS?

    Look for moveout errors in the logs for the affected nodes.  It could be that the moveout is failing because of a "too many ros containers" error, in which case you could temporarily increase the ContainersPerProjectionLimit configuration parameter to let the moveout succeed.  But track down the root cause first.  If the root cause isn't obvious, this seems like a good time to open a Support ticket.

      --Sharon

  • Options
    Hey Sharon,
      Looks like all nodes (we have a 5 node cluster).

    dbadmin=> select distinct node_name,count(*) from projection_checkpoint_epochs where checkpoint_epoch=0 group by 1;

        node_name    | count 

    -----------------+-------

     v_gadw_node0002 |    51

     v_gadw_node0004 |    51

     v_gadw_node0001 |    51

     v_gadw_node0003 |    23

     v_gadw_node0005 |    51

    (5 rows)


    dbadmin=> 


    I don't see any of the projections listed in the DELETE_VECTORS table (please let me know if there is a better way to query that).

    dbadmin=> select * from DELETE_VECTORS where projection_name in (select projection_name from projection_checkpoint_epochs where checkpoint_epoch=0);

     node_name | schema_name | projection_name | storage_type | dv_oid | storage_oid | deleted_row_count | used_bytes | start_epoch | end_epoch | is_sorted 

    -----------+-------------+-----------------+--------------+--------+-------------+-------------------+------------+-------------+-----------+-----------

    (0 rows)


    dbadmin=>

    I was able to check all of our logs, and it looks like we did have some of those "too many ros containers" messages, but they look to have cleared.  Any thoughts on if that would continue to be an issue even tho the log says it cleared?


    2014-06-01 16:41:00.523 TM Moveout:0x2b4ff0005100 <LOG> @v_gadw_node0002: 00000/3298: Event Posted: Event Code:14 Event Id:197 Event Severity: Warning [4] PostedTimestamp: 2014-06-01 16:41:00.52317 ExpirationTimestamp: 2014-06-01 16:41:15.52317 EventCodeDescription: Timer Service Task Error ProblemDescription: threadShim: Attempted to create too many ROS containers for projection stage_uat.apple_financial_aggs_DBD_42_rep_20130213_stage_20130213_stage_node0002 DatabaseName:  Hostname: msr-bi-vert04.ubisoft.org

    2014-06-01 16:41:00.523 TM Moveout:0x2b4ff0005100 <ERROR> @v_gadw_node0002: {threadShim} 53000/2245: Attempted to create too many ROS containers for projection stage_uat.apple_financial_aggs_DBD_42_rep_20130213_stage_20130213_stage_node0002

    2014-06-01 16:41:16.003 AgeOutEvents:0x2b4ff000a2c0 <LOG> @v_gadw_node0002: 00000/3293: Event Cleared: Event Code:14 Event Id:197 Event Severity: Warning [4] PostedTimestamp: 2014-06-01 16:41:00.52317 ExpirationTimestamp: 2014-06-01 16:41:15.52317 EventCodeDescription: Timer Service Task Error ProblemDescription: threadShim: Attempted to create too many ROS containers for projection stage_uat.apple_financial_aggs_DBD_42_rep_20130213_stage_20130213_stage_node0002 DatabaseName:  Hostname: msr-bi-vert04.ubisoft.org

    2014-06-01 16:41:31.003 AgeOutEvents:0x2b4ff000a5f0 <LOG> @v_gadw_node0002: 00000/3293: Event Cleared: Event Code:4 Event Id:0 Event Severity: Warning [4] PostedTimestamp: 2014-06-01 16:41:00.506652 ExpirationTimestamp: 2014-06-01 16:41:30.506652 EventCodeDescription: Too Many ROS Containers ProblemDescription: Too many ROS containers exist on this node for projection: stage_uat.apple_financial_aggs_DBD_42_rep_20130213_stage_20130213_stage_node0002 DatabaseName:  Hostname: msr-bi-vert04.ubisoft.org


    I don't see the above errors for every member of the cluster, just nodes 2 & 4, and only on 6/01.  

    We have engaged professional services, but we're not due to meet with them until next week; so hoping we can continue to research so as to give them as much information as possible.


    Thanks,

    Jess


  • Options
    Does "select * from system" still show data in the WOS?

    You've said that you don't see anything in the WOS in storage_containers or delete_vectors.  Could you double check:

    select storage_type, count(*) from delete_vectors where storage_type like '%WOS' group by 1;

    select storage_type, count(*) from storage_containers where storage_type like '%WOS' group by 1;


    I suggest taking one of the projections with a zero checkpoint_epoch and searching the logs for it.  You're interested in moveout - confirm that it's trying to moveout these projections [I am assuming they do actually have data in the WOS] and try to identify the error.  A 0 checkpoint_epoch suggests to me that these projections never ever had a successful moveout from the WOS.  It would also be interesting to trace the history on one of the projections/tables - how was it originally loaded? Any interesting actions performed on these around 2014-05-31 19:39:07.008033+00  when the stale checkpoint event started?

      --Sharon





  • Options
    Hey Sharon,
       We don't look to have any entries for WOS in system table, and no results for the queries you provided.

    dbadmin=> select current_epoch,ahm_epoch,last_good_epoch,node_count,wos_used_bytes,wos_row_count from system;

     current_epoch | ahm_epoch | last_good_epoch | node_count | wos_used_bytes | wos_row_count 

    ---------------+-----------+-----------------+------------+----------------+---------------

            190789 |         0 |               0 |          5 |              0 |             0

    (1 row)


    dbadmin=> select storage_type, count(*) from delete_vectors where storage_type like '%WOS' group by 1;

     storage_type | count 

    --------------+-------

    (0 rows)


    dbadmin=> 

    dbadmin=> select storage_type, count(*) from storage_containers where storage_type like '%WOS' group by 1;

     storage_type | count 

    --------------+-------

    (0 rows)


    dbadmin=> 

    I will check the logs again for those projection names and get back to you.

    Thanks again!

    Jess


  • Options
    No data in the WOS, and no delete vectors in the WOS, so any moveouts that were needed completed.  Your main concern is not losing your database because LGE is still zero.  I'd take a look at those tables whose projections have a zero checkpoint epoch, see how many rows they have, and see about recreating these tables to see if this state clears up.  You could move the data into new tables and drop the current tables (and verify that the checkpoint epoch of the new tables is ok).  Or if the affected tables are small just drop them.  

      --Sharon

  • Options
    Hi Jesse!

    1. Can you post output and output of vertica.log while you do:
    select refresh();
    select purge();
    select make_ahm_now(); -- without true
    select * from system;
    ?

    2. Are you physical machines or virtual?

    3. What is file system for Data catalog? Does it with journaling(like ext4) or without(like ext2)?

    PS
    www.pastebin.com
  • Options
    Hello,
      Just wanted to update the thread.  Thanks Daniel for replying, but Sharon's suggestion to remove the tables and recreate did the trick.  We reviewed the tables and removed them, and the system immediately created a new LGE and updated the AHM.

    We did want to ask the question tho, as several were global temp tables, and several were regular tables.  Would having global temp tables cause this issue?  As Sharon mentioned, want to make sure we remove the culprit, so want to review our logs in more detail and try to isolate the culprit.

    I'd also like to continue to dig to find the root cause, so I can post here.  Let me know if you have any thoughts on how to "test" this (in dev, as I do not want to interrupt our sleep again :-)

    Here's our latest query against system:
    dbadmin=> select * from system;

    -[ RECORD 1 ]------------+--------------

    current_epoch            | 190824

    ahm_epoch                | 190823

    last_good_epoch          | 190823

    refresh_epoch            | -1

    designed_fault_tolerance | 1

    node_count               | 5

    node_down_count          | 0

    current_fault_tolerance  | 1

    catalog_revision_number  | 199888

    wos_used_bytes           | 0

    wos_row_count            | 0

    ros_used_bytes           | 1818455205032

    ros_row_count            | 49001468391

    total_used_bytes         | 1818455205032

    total_row_count          | 49001468391


    dbadmin=> 

    Thanks again guys!  Huge help, and we can now sleep at night :-) 
    - Jess


  • Options
    [Apologies for not posting this sooner; it would have been useful before you dropped the relevant tables. -Ben]

    The following query should indicate which projection is holding back the LGE:

    select p.name,checkpointepoch from vs_storage_columns sc, vs_projection_columns pc, vs_projections p, vs_projection_column_order co where sc.projcolumn = pc.oid and pc.proj = p.oid and co.column_id = pc.oid and co.sort_position = 0 and checkpointepoch != '-1' and not p.istemp;

    If you don't need the data in this table, the simplest solution is to drop it.

    Did you ever flirt with K=0 with this cluster (set it down to 0, then move it back to 1)?

    Temp tables should not impact the LGE.
  • Options
    Thanks Ben!  I am going to feverishly write up everything we learned with this one, and will include your note.  Good to know about temp tables vs LGE.

    In terms of K safety, we've always been at K1.

  • Options
    >> remove the tables and recreate did the trick
    Now importance is not to solve a problem(because problem exists, support likes to ignore problem - put solved, what doesn't solved) but to locate a problem.
    If you did MAKE_AHM_NOW and didn't get any error but AHM and LGE still 0 - its a problem. If it exists it can occur in the feature one more time. Say you have 5TB data. How many time will take to recreate? And what abot 1PB? No, its isn't a work around - its trick.  Can you do what I asked?

    Now will see a support reaction.

    PS
    Agree with Ben, I also suspect K-Safety.
  • Options
    Hey Daniel,
      I'm not following your statements or logic from above, but as requested; here is the output of make_ahm_now.  Note that since performing the steps Sharon suggested we are not having any issues with AHM, or getting the previous LGE: 0 error from that function call (still monitoring of course, but everything seems good).

    dbadmin=> select * from system;

     current_epoch | ahm_epoch | last_good_epoch | refresh_epoch | designed_fault_tolerance | node_count | node_down_count | current_fault_tolerance | catalog_revision_number | wos_used_bytes | wos_row_count | ros_used_bytes | ros_row_count | total_used_bytes | total_row_count 

    ---------------+-----------+-----------------+---------------+--------------------------+------------+-----------------+-------------------------+-------------------------+----------------+---------------+----------------+---------------+------------------+-----------------

            200703 |    200702 |          200702 |            -1 |                        1 |          5 |               0 |                       1 |                  210206 |              0 |             0 |  1842356965097 |   49267613264 |    1842356965097 |     49267613264

    (1 row)


    dbadmin=> select make_ahm_now();

              make_ahm_now           

    ---------------------------------

     AHM set (New AHM Epoch: 200703)

    (1 row)


    dbadmin=> select * from system;

     current_epoch | ahm_epoch | last_good_epoch | refresh_epoch | designed_fault_tolerance | node_count | node_down_count | current_fault_tolerance | catalog_revision_number | wos_used_bytes | wos_row_count | ros_used_bytes | ros_row_count | total_used_bytes | total_row_count 

    ---------------+-----------+-----------------+---------------+--------------------------+------------+-----------------+-------------------------+-------------------------+----------------+---------------+----------------+---------------+------------------+-----------------

            200704 |    200703 |          200703 |            -1 |                        1 |          5 |               0 |                       1 |                  210208 |              0 |             0 |  1842356965097 |   49267613264 |    1842356965097 |     49267613264

    (1 row)


    dbadmin=> 

    To answer your other questions,

    1.> See above, but I don't think it necessary to go through all the steps as the issue is seemingly resolved (in that we now have a proper AHM and LGE is advancing as we would expect).

    2.> 5 node hardware cluster, 2TB volume per node (not LVM)

    3.> Ext4


    Thanks again for your input.

    Jess

  • Options
    Jess - what version are you running?

      --Sharon

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file