Vertica delete issues
Hello, We are noticing issues in our cluster in that we are keeping deleted records. While reviewing our logs, we are seeing the Tuple Mover attempt to merge out, but then roll back the transaction shortly after.
Our settings are follow:
HistoryRetentionEpochs is set to -1
HistoryRetentionTime is set to 0
MergeOutInterval is set to 600
---
The roll back log entries we are seeing are:
2014-06-04 16:53:00.002 TM Mergeout(00):0x2b50500458a0-b00000000238eb [Txn] <INFO> Begin Txn: b00000000238eb 'Mergeout: Tuple Mover'
2014-06-04 16:53:00.310 TM Mergeout(00):0x2b50500458a0-b00000000238eb [TM] <INFO> Tuple Mover: nothing to merge out
2014-06-04 16:53:00.332 TM Mergeout(00):0x2b50500458a0-b00000000238eb [TM] <INFO> Tuple Mover: no DV to merge out
2014-06-04 16:53:00.332 TM Mergeout(00):0x2b50500458a0-b00000000238eb [Txn] <INFO> Rollback Txn: b00000000238eb 'Mergeout: Tuple Mover'
2014-06-04 16:53:00.346 TM Mergeout(00):0x2b50500458a0-b00000000238ec [Txn] <INFO> Begin Txn: b00000000238ec 'collectMoveStorageJobs'
2014-06-04 16:53:00.367 TM Mergeout(00):0x2b50500458a0-b00000000238ec [Txn] <INFO> Rollback Txn: b00000000238ec 'collectMoveStorageJobs'
2014-06-04 16:53:00.368 TM Mergeout(00):0x2b50500458a0 [Util] <INFO> Task 'TM Mergeout(00)' enabled
This looks as tho there are no delete vectors to be moved out, however if we query delete vectors, we see quite a few. Our disk usage is also moving up quite quickly, so we are concerned that we are not tuning correctly.
Another observation is that our last known good epoch is not moving forward, and so we are not in sync with regards to epoch and ahm.
Thanks for any help or insight anyone can provide.
Jess
Our settings are follow:
HistoryRetentionEpochs is set to -1
HistoryRetentionTime is set to 0
MergeOutInterval is set to 600
---
The roll back log entries we are seeing are:
2014-06-04 16:53:00.002 TM Mergeout(00):0x2b50500458a0-b00000000238eb [Txn] <INFO> Begin Txn: b00000000238eb 'Mergeout: Tuple Mover'
2014-06-04 16:53:00.310 TM Mergeout(00):0x2b50500458a0-b00000000238eb [TM] <INFO> Tuple Mover: nothing to merge out
2014-06-04 16:53:00.332 TM Mergeout(00):0x2b50500458a0-b00000000238eb [TM] <INFO> Tuple Mover: no DV to merge out
2014-06-04 16:53:00.332 TM Mergeout(00):0x2b50500458a0-b00000000238eb [Txn] <INFO> Rollback Txn: b00000000238eb 'Mergeout: Tuple Mover'
2014-06-04 16:53:00.346 TM Mergeout(00):0x2b50500458a0-b00000000238ec [Txn] <INFO> Begin Txn: b00000000238ec 'collectMoveStorageJobs'
2014-06-04 16:53:00.367 TM Mergeout(00):0x2b50500458a0-b00000000238ec [Txn] <INFO> Rollback Txn: b00000000238ec 'collectMoveStorageJobs'
2014-06-04 16:53:00.368 TM Mergeout(00):0x2b50500458a0 [Util] <INFO> Task 'TM Mergeout(00)' enabled
This looks as tho there are no delete vectors to be moved out, however if we query delete vectors, we see quite a few. Our disk usage is also moving up quite quickly, so we are concerned that we are not tuning correctly.
Another observation is that our last known good epoch is not moving forward, and so we are not in sync with regards to epoch and ahm.
Thanks for any help or insight anyone can provide.
Jess
0
Comments
select * from system;
current_epoch | ahm_epoch | last_good_epoch | refresh_epoch | designed_fault_tolerance | node_count | node_down_count | current_fault_tolerance | catalog_revision_number | wos_used_bytes | wos_row_count | ros_used_bytes | ros_row_count | total_used_bytes | total_row_count
---------------+-----------+-----------------+---------------+--------------------------+------------+-----------------+-------------------------+-------------------------+----------------+---------------+----------------+---------------+------------------+-----------------
156460 | 0 | 0 | -1 | 1 | 5 | 0 | 1 | 163559 | 2878865408 | 8348879 | 1513703369554 | 43359567485 | 1516582234962 | 43367916364
(1 row)
We think we may be facing a similar issue to https://community.vertica.com/vertica/topics/get_last_good_epoch_not_advancing_so_cant_drop_old_proj..., but there do not seem to be any notes on how to force the move out to occur. As mentioned, we see attempts, but are also seeing rollbacks.
The specific text from the previous link is
" On move out of the old data from WOS, LGE advanced. ". How was this accomplished? We are concerned as we are in the process of reloading our warehouse and are not seeing the LGE advance, and so if we were to run into troubles our understanding is we'd have to start from the beginning again.
Thanks for any advice.
On fresh db: Should I worry about this on my laptop (single node cluster)?
While I have seen over history the existence of Rollbacks, this time is puzzling as its the MergeOut operation, and we do not see the AHM moving forward, or the LGE being set when these transactions complete (or rollback in this case). In the case you show, it seems that your LGE and AHM are moving forward. Do you have thoughts in what scenario's the LGE and AHM would not progress? Is there something more I can check? While rollbacks are evident, we are wondering if there is a correlation as we are seeing these messages:
2014-06-05 04:37:37.626 DistCall Dispatch:0x2b2d6c635350-a0000000011832 [Txn] <INFO> Rollback Txn: a0000000011832 'Manage Epochs: Advance AHM'
Are we chasing a red herring with the Tuple Mover? Anything else I can check in the logs to figure out why LGE / AHM are not progressing?
Thanks,
Jess
Epoch advanced when commit occurs and do not advanced on rollback. Im sure its a "feature", because Vertica can do "historical queries" and Retention Time defines how many history to save (in epochs or in seconds). You configured to "manually", i.e all data is saved, no delete vectors purging by Vertica, but only by manual purging. Thats why AHM doesnt advanced. >> Anything else I can check in the logs to figure out why LGE / AHM are not progressing?
Tables:
- dc_tuple_mover_events
- vs_tuple_mover_operations
- tuple_mover_operations
- resource_rejection_details
Function: Stale Check Points - http://vertica-forums.com/viewtopic.php?f=7&t=1710#p5648(Sharon?)
K-Safety - Does all tables satisfies K-Safety?
This is perfect! I think this is going to resolve our issue. I'm going to do some maintenance work tonight and run the do_tm_task('merge out') during that window and then run a full back up. I'll post back if this issue is resolved.
Jess
1. Unrefreshed projections: select * from projections where not is_up_to_date;
2. Something is interfering with a moveout - it can't can't a lock for example. This might be because of a long-running replay delete tuple mover phase - see tuple_mover_operations where is_executing and/or the locks table.
3. If you have stale checkpoints (active_events table) and don't know which projection is causing them, check the storage_containers table for storage_type 'WOS' and see what belongs to an old epoch.
--Sharon
We were reviewing more before we perform our maintenance, and wanted to verify the statement:
"Epoch advanced when commit occurs and do not advanced on rollback. Im sure its a "feature", because Vertica can do "historical queries" and Retention Time defines how many history to save (in epochs or in seconds). You configured to "manually", i.e all data is saved, no delete vectors purging by Vertica, but only by manual purging. Thats why AHM doesnt advanced."
Our understanding is that we disabled HistoryRetentionEpochs by setting it to -1 (the default), and by setting HistoryRetentionTime to 0, vertica would purge records as it finds them needing to be purged. Is this accurate, or have we mis-understood the documentation?
Thanks again for all your insight!
Jess
We attempted to run through the process Daniel mentioned, but it completed (we did find some table/projections that we need to tune), but running select make_ahm_epoch() returned the same message as before (see below).
select make_ahm_now('true');
make_ahm_now
-----------------------------------
AHM already at LGE (AHM Epoch: 0)
(1 row)
While investigating, we found that we do have a stale checkpoint as mentioned by Sharon as being potential reason for this behavior, but looking over the table that was created (and similar tables being loaded in the time window), we do not see anything that stands out as an issue. If we were to remove those tables and reload, would that help in this situation.dbadmin=> select * from active_events where event_code_description='Stale Checkpoint';
node_name | event_code | event_id | event_severity | event_posted_timestamp | event_expiration | event_code_description | event_problem_description | reporting_node | event_sent_to_channels | event_posted_count
-----------------+------------+----------+----------------+-------------------------------+-------------------------------+------------------------+-------------------------------------------------------------------------------------------------------+-----------------+------------------------+--------------------
v_gadw_node0001 | 15 | 0 | Warning | 2014-05-31 19:39:07.008033+00 | 2014-06-07 04:04:57.01115+00 | Stale Checkpoint | Node v_gadw_node0001 has data 3750 seconds old which has not yet made it to disk. (LGE: 0 Last: 1370) | v_gadw_node0001 | Vertica Log | 1828
v_gadw_node0002 | 15 | 0 | Warning | 2014-05-31 19:39:07.015239+00 | 2014-06-07 04:05:18.017214+00 | Stale Checkpoint | Node v_gadw_node0002 has data 3750 seconds old which has not yet made it to disk. (LGE: 0 Last: 1370) | v_gadw_node0002 | Vertica Log | 1828
v_gadw_node0004 | 15 | 0 | Warning | 2014-05-31 19:39:07.007677+00 | 2014-06-07 04:04:45.00903+00 | Stale Checkpoint | Node v_gadw_node0004 has data 3750 seconds old which has not yet made it to disk. (LGE: 0 Last: 1370) | v_gadw_node0004 | Vertica Log | 1828
v_gadw_node0005 | 15 | 0 | Warning | 2014-05-31 19:39:07.009135+00 | 2014-06-07 04:04:50.01101+00 | Stale Checkpoint | Node v_gadw_node0005 has data 3750 seconds old which has not yet made it to disk. (LGE: 0 Last: 1370) | v_gadw_node0005 | Vertica Log | 1828
(4 rows)
dbadmin=>
We took the date / time and searched with-in our logs, looking for the potential culprit, but the COPY statements being used were all using the DIRECT keyword and so our assumption was that it wrote directly to ROS.
v_gadw_node0001_catalog]$ grep -A1 COPY check_this |grep stage_market|sed -e "s/^ *//g" |sort|uniq -c|sort -nr
1227 stage_market.amazon_search_terms (
42 stage_market.amazon_market_basket_analysis (
42 COPY stage_market.amazon_alt_purchase (
38 stage_market.amazon_item_view_pre_purchase (
22 stage_market.amazon_cust_purchase_behavior (
2 stage_market.dfa_standard_rpt_campaign_agg_stats(
Per your suggestion, we did check the storage_containers and we were unable to find any containers of type WOS.
Please let us know if there is anything further we can test. We'll keep reviewing on our end and advise the thread if we find anything else that stands out.Thanks again,
Jess
PS
If you have node down and projection that segmented on this node, you must to redistribute your data: on all nodes except a down node. LGE can't be advanced when node down has "data" on it.
If so => thats why LGE isn't advanced, that's why you get "Stale Check Point"
No nodes are down and functionally everything seems sane:
dbadmin=> select node_name,node_state,is_ephemeral from nodes;
node_name | node_state | is_ephemeral
-----------------+------------+--------------
v_gadw_node0001 | UP | f
v_gadw_node0002 | UP | f
v_gadw_node0003 | UP | f
v_gadw_node0004 | UP | f
v_gadw_node0005 | UP | f
(5 rows)
dbadmin=>
K-Safety is set to 1 for the cluster. Is there something more I can check in that regard?
Thanks,
Jess
select * from projection_checkpoint_epochs order by checkpoint_epoch;
--Sharon
Yes, there are 227 entries that are at 0.
dbadmin=> select count(*) from projection_checkpoint_epochs where checkpoint_epoch=0;
count
-------
227
(1 row)
dbadmin=>
Thanks,
Jess
Does the DELETE_VECTORS table show these projections as having DVs in the WOS?
Look for moveout errors in the logs for the affected nodes. It could be that the moveout is failing because of a "too many ros containers" error, in which case you could temporarily increase the ContainersPerProjectionLimit configuration parameter to let the moveout succeed. But track down the root cause first. If the root cause isn't obvious, this seems like a good time to open a Support ticket.
--Sharon
Looks like all nodes (we have a 5 node cluster).
dbadmin=> select distinct node_name,count(*) from projection_checkpoint_epochs where checkpoint_epoch=0 group by 1;
node_name | count
-----------------+-------
v_gadw_node0002 | 51
v_gadw_node0004 | 51
v_gadw_node0001 | 51
v_gadw_node0003 | 23
v_gadw_node0005 | 51
(5 rows)
dbadmin=>
I don't see any of the projections listed in the DELETE_VECTORS table (please let me know if there is a better way to query that).
dbadmin=> select * from DELETE_VECTORS where projection_name in (select projection_name from projection_checkpoint_epochs where checkpoint_epoch=0);
node_name | schema_name | projection_name | storage_type | dv_oid | storage_oid | deleted_row_count | used_bytes | start_epoch | end_epoch | is_sorted
-----------+-------------+-----------------+--------------+--------+-------------+-------------------+------------+-------------+-----------+-----------
(0 rows)
dbadmin=>
I was able to check all of our logs, and it looks like we did have some of those "too many ros containers" messages, but they look to have cleared. Any thoughts on if that would continue to be an issue even tho the log says it cleared?
2014-06-01 16:41:00.523 TM Moveout:0x2b4ff0005100 <LOG> @v_gadw_node0002: 00000/3298: Event Posted: Event Code:14 Event Id:197 Event Severity: Warning [4] PostedTimestamp: 2014-06-01 16:41:00.52317 ExpirationTimestamp: 2014-06-01 16:41:15.52317 EventCodeDescription: Timer Service Task Error ProblemDescription: threadShim: Attempted to create too many ROS containers for projection stage_uat.apple_financial_aggs_DBD_42_rep_20130213_stage_20130213_stage_node0002 DatabaseName: Hostname: msr-bi-vert04.ubisoft.org
2014-06-01 16:41:00.523 TM Moveout:0x2b4ff0005100 <ERROR> @v_gadw_node0002: {threadShim} 53000/2245: Attempted to create too many ROS containers for projection stage_uat.apple_financial_aggs_DBD_42_rep_20130213_stage_20130213_stage_node0002
2014-06-01 16:41:16.003 AgeOutEvents:0x2b4ff000a2c0 <LOG> @v_gadw_node0002: 00000/3293: Event Cleared: Event Code:14 Event Id:197 Event Severity: Warning [4] PostedTimestamp: 2014-06-01 16:41:00.52317 ExpirationTimestamp: 2014-06-01 16:41:15.52317 EventCodeDescription: Timer Service Task Error ProblemDescription: threadShim: Attempted to create too many ROS containers for projection stage_uat.apple_financial_aggs_DBD_42_rep_20130213_stage_20130213_stage_node0002 DatabaseName: Hostname: msr-bi-vert04.ubisoft.org
2014-06-01 16:41:31.003 AgeOutEvents:0x2b4ff000a5f0 <LOG> @v_gadw_node0002: 00000/3293: Event Cleared: Event Code:4 Event Id:0 Event Severity: Warning [4] PostedTimestamp: 2014-06-01 16:41:00.506652 ExpirationTimestamp: 2014-06-01 16:41:30.506652 EventCodeDescription: Too Many ROS Containers ProblemDescription: Too many ROS containers exist on this node for projection: stage_uat.apple_financial_aggs_DBD_42_rep_20130213_stage_20130213_stage_node0002 DatabaseName: Hostname: msr-bi-vert04.ubisoft.org
I don't see the above errors for every member of the cluster, just nodes 2 & 4, and only on 6/01.
We have engaged professional services, but we're not due to meet with them until next week; so hoping we can continue to research so as to give them as much information as possible.
Thanks,
Jess
You've said that you don't see anything in the WOS in storage_containers or delete_vectors. Could you double check:
select storage_type, count(*) from delete_vectors where storage_type like '%WOS' group by 1;
select storage_type, count(*) from storage_containers where storage_type like '%WOS' group by 1;
I suggest taking one of the projections with a zero checkpoint_epoch and searching the logs for it. You're interested in moveout - confirm that it's trying to moveout these projections [I am assuming they do actually have data in the WOS] and try to identify the error. A 0 checkpoint_epoch suggests to me that these projections never ever had a successful moveout from the WOS. It would also be interesting to trace the history on one of the projections/tables - how was it originally loaded? Any interesting actions performed on these around 2014-05-31 19:39:07.008033+00 when the stale checkpoint event started?
--Sharon
We don't look to have any entries for WOS in system table, and no results for the queries you provided.
dbadmin=> select current_epoch,ahm_epoch,last_good_epoch,node_count,wos_used_bytes,wos_row_count from system;
current_epoch | ahm_epoch | last_good_epoch | node_count | wos_used_bytes | wos_row_count
---------------+-----------+-----------------+------------+----------------+---------------
190789 | 0 | 0 | 5 | 0 | 0
(1 row)
dbadmin=> select storage_type, count(*) from delete_vectors where storage_type like '%WOS' group by 1;
storage_type | count
--------------+-------
(0 rows)
dbadmin=>
dbadmin=> select storage_type, count(*) from storage_containers where storage_type like '%WOS' group by 1;
storage_type | count
--------------+-------
(0 rows)
dbadmin=>
I will check the logs again for those projection names and get back to you.
Thanks again!
Jess
--Sharon
1. Can you post output and output of vertica.log while you do: ?
2. Are you physical machines or virtual?
3. What is file system for Data catalog? Does it with journaling(like ext4) or without(like ext2)?
PS
www.pastebin.com
Just wanted to update the thread. Thanks Daniel for replying, but Sharon's suggestion to remove the tables and recreate did the trick. We reviewed the tables and removed them, and the system immediately created a new LGE and updated the AHM.
We did want to ask the question tho, as several were global temp tables, and several were regular tables. Would having global temp tables cause this issue? As Sharon mentioned, want to make sure we remove the culprit, so want to review our logs in more detail and try to isolate the culprit.
I'd also like to continue to dig to find the root cause, so I can post here. Let me know if you have any thoughts on how to "test" this (in dev, as I do not want to interrupt our sleep again :-)
Here's our latest query against system:
dbadmin=> select * from system;
-[ RECORD 1 ]------------+--------------
current_epoch | 190824
ahm_epoch | 190823
last_good_epoch | 190823
refresh_epoch | -1
designed_fault_tolerance | 1
node_count | 5
node_down_count | 0
current_fault_tolerance | 1
catalog_revision_number | 199888
wos_used_bytes | 0
wos_row_count | 0
ros_used_bytes | 1818455205032
ros_row_count | 49001468391
total_used_bytes | 1818455205032
total_row_count | 49001468391
dbadmin=>
Thanks again guys! Huge help, and we can now sleep at night :-)- Jess
The following query should indicate which projection is holding back the LGE:
select p.name,checkpointepoch from vs_storage_columns sc, vs_projection_columns pc, vs_projections p, vs_projection_column_order co where sc.projcolumn = pc.oid and pc.proj = p.oid and co.column_id = pc.oid and co.sort_position = 0 and checkpointepoch != '-1' and not p.istemp;
If you don't need the data in this table, the simplest solution is to drop it.
Did you ever flirt with K=0 with this cluster (set it down to 0, then move it back to 1)?
Temp tables should not impact the LGE.
In terms of K safety, we've always been at K1.
Now importance is not to solve a problem(because problem exists, support likes to ignore problem - put solved, what doesn't solved) but to locate a problem.
If you did MAKE_AHM_NOW and didn't get any error but AHM and LGE still 0 - its a problem. If it exists it can occur in the feature one more time. Say you have 5TB data. How many time will take to recreate? And what abot 1PB? No, its isn't a work around - its trick. Can you do what I asked?
Now will see a support reaction.
PS
Agree with Ben, I also suspect K-Safety.
I'm not following your statements or logic from above, but as requested; here is the output of make_ahm_now. Note that since performing the steps Sharon suggested we are not having any issues with AHM, or getting the previous LGE: 0 error from that function call (still monitoring of course, but everything seems good).
dbadmin=> select * from system;
current_epoch | ahm_epoch | last_good_epoch | refresh_epoch | designed_fault_tolerance | node_count | node_down_count | current_fault_tolerance | catalog_revision_number | wos_used_bytes | wos_row_count | ros_used_bytes | ros_row_count | total_used_bytes | total_row_count
---------------+-----------+-----------------+---------------+--------------------------+------------+-----------------+-------------------------+-------------------------+----------------+---------------+----------------+---------------+------------------+-----------------
200703 | 200702 | 200702 | -1 | 1 | 5 | 0 | 1 | 210206 | 0 | 0 | 1842356965097 | 49267613264 | 1842356965097 | 49267613264
(1 row)
dbadmin=> select make_ahm_now();
make_ahm_now
---------------------------------
AHM set (New AHM Epoch: 200703)
(1 row)
dbadmin=> select * from system;
current_epoch | ahm_epoch | last_good_epoch | refresh_epoch | designed_fault_tolerance | node_count | node_down_count | current_fault_tolerance | catalog_revision_number | wos_used_bytes | wos_row_count | ros_used_bytes | ros_row_count | total_used_bytes | total_row_count
---------------+-----------+-----------------+---------------+--------------------------+------------+-----------------+-------------------------+-------------------------+----------------+---------------+----------------+---------------+------------------+-----------------
200704 | 200703 | 200703 | -1 | 1 | 5 | 0 | 1 | 210208 | 0 | 0 | 1842356965097 | 49267613264 | 1842356965097 | 49267613264
(1 row)
dbadmin=>
To answer your other questions,
1.> See above, but I don't think it necessary to go through all the steps as the issue is seemingly resolved (in that we now have a proper AHM and LGE is advancing as we would expect).
2.> 5 node hardware cluster, 2TB volume per node (not LVM)
3.> Ext4
Thanks again for your input.
Jess
--Sharon