Last Good Epoch/Ancient History Mark not advancing

TimNooren · November 2016

We run a three node cluster with Vertica 7.2.0-1, and since november 7th our Last Good Epoch / Ancient History Mark has not advanced.

Current AHM Time: 2016-11-07 15:36:48.442813+00

I followed the steps in this guide: https://my.vertica.com/hpe-vertica-troubleshooting-checklists/ancient-history-mark-not-advancing/ , but this did not have any effect.

Looking at the active events table (SELECT * FROM active_events;) mentions a stale checkpoint on one of the nodes from around the same time the AHM stopped progressing:

2016-11-07 17:40:50	2016-11-29 10:40:14	Stale Checkpoint	Node v_node0002 has data 3841 seconds old which has not yet made it to disk. (LGE: 26549170 Last: 26549664)

From the logs I can find one error from just before the AHM stopped on november 7th:

2016-11-07 15:01:20.923 TM Moveout:0x7f6a5cd6cd90 <ERROR> @v_byk_dwh_node0002: {threadShim} 55V03/5157: Unavailable: [Txn 0xb000002e05f0a3] Moveout of mart_sales_assistant.snowplow_events_b0 - timeout error Timed out T locking Table:mart_sales_assistant.snowplow_events. X held by [user etl_user

From other posts (https://community.dev.hpe.com/t5/Vertica-Forum/Stale-checkpoint-and-too-many-ROS/td-p/221273, https://community.dev.hpe.com/t5/Vertica-Forum/get-last-good-epoch-not-advancing-so-can-t-drop-old-projections/td-p/210267) I've gathered the problem might be that data that is stuck in WOS which prevents the LGE from advancing, or there are too many ROS containers. We've fixed a lot of issues with ROS containers mainly by repartitioning tables. Running a Move Out manually still does not seem to solve anything.

Further details:

MoveOutInterval: 200

MergeOutInterval: 200

HistoryRetentionTime: 0

wosdata resource pool: memorysize: 2G, maxmemorysize: 2G

All help is appreciated.

Sharon_Cutter · November 2016

You have either data or delete vectors stuck in the WOS.

- Check the STORAGE_CONTAINERS table where storage_type like '%WOS' ordered by start_epoch to see if there's any data at an epoch right around your current AHM epoch.

- Check the DELETE_VECTORS table where storage_type like '%WOS' ordered by start_epoch to see if there are any DVs at an eopch right around your current AHM epoch.

That will help you identify the problematic table. Then look for locks on that table using the LOCKs table and checking TUPLE_MOVER_OPERATIONS where is_executing.

It could be a long-running tuple mover operation that is preventing the data from getting moved out, or a "too many ROS containers" condition.

--Sharon

TimNooren · November 2016

Hi Sharon,

Thank you for your reply.

I followed your advise with the following results:

The LGE is 26549170

Current epoch is 27505478

select * from storage_containers where storage_type = 'WOS' order by start_epoch; shows me the earliest start_epoch in WOS is 27504665.

select * from DELETE_VECTORS where storage_type like '%WOS%' order by start_epoch; shows an earliest start_epoch of 27504667

Both results are close to the current epoch, which leads me to believe there isn't actually any data stuck in WOS?

select * from TUPLE_MOVER_OPERATIONS where is_executing; does not show any long running Moveout operations.

We did have issues with too many ROS containers around the time the LGE stopped advancing, but not since.

Sharon_Cutter · December 2016

You could look at the PROJECTION_CHECKPOINT_EPOCHS table. Look for a projection on that node with the checkpoint_epoch value is the same as your current LGE. When I just looked at that table it was slow to query so you probably want to dump it to a regular table or a temp table before you start poking around.

If that doesn't help, open a support case.

--Sharon

TimNooren · December 2016

Hi Sharon,

This actually solved the problem. Thank you very much.

I followed the steps as you advised:

Found Last Good Epoch using: SELECT get_last_good_epoch(); . LGE was 26549170.

Looked for projections that had a checkpoint_epoch corresponding to LGE:

SELECT * FROM PROJECTION_CHECKPOINT_EPOCHS WHERE checkpoint_epoch = 26549170;

This resulted in one table, which we dropped (cascade) and recreated. Straight away the LGE advanced to 27135640, which is still not close to the current epoch of 27593331, but we can repeat the procedure looking at the next table that stops the LGE from advancing.

Thanks again.

Sharon_Cutter · December 2016

Yay!

Though if you are seeing the problem on multiple tables, it would be worth understanding how/why you're getting into this situation.

You are on a relatively old version of 7.2. There's this fix in a 7.2.3 hotfix if this sounds plausible:

--Sharon

VER-23217

Recovery

The checkpoint epoch of a projection did not advance when the projection became safe as a result of a change in k-safety. This issue has been resolved.

Sharon_Cutter · December 2016

Looking back at your original post and the original error:

2016-11-07 15:01:20.923 TM Moveout:0x7f6a5cd6cd90 <ERROR> @v_byk_dwh_node0002: {threadShim} 55V03/5157: Unavailable: [Txn 0xb000002e05f0a3] Moveout of mart_sales_assistant.snowplow_events_b0 - timeout error Timed out T locking Table:mart_sales_assistant.snowplow_events. X held by [user etl_user

That lock was no longer present?

Looking through the release notes again, there are also bugs related to LGE for table creation during recovery and dropping storage locations. Definitely worth an upgrade to the most recent 7.2 anyway.

TimNooren · December 2016

Hi Sharon,

Yes, the lock is no longer present. The issue seemed to only affect two tables, one of which was definitely created during recovery, so I'm assuming that's where the problem originated. The hotfix seems to describe the issue exactly.

We're planning to upgrade to the latest 7.2.x release soon, so hopefully that will prevent this issue from occuring in the future.

Again, thanks a lot for your help.

Tim

orahow · June 2018

During AHM syncing activity my [vertica](During AHM syncing activity my vertica nodes went down several time and I started the nodes manually. Can you explain why it happens. I used the steps explained below. https://www.orahow.com/2018/06/fix-ahm-not-advancing-to-last-epoch.html "vertica") nodes went down several time and I started the nodes manually. Can you explain why it happens. I used the steps explained below.

Jim_Knicely · June 2018

Hi,

Are you asking why did your nodes go down? I'm guess a network issue? Have you looked at the vertica.log for some sort of indication of a problem?

Check the "Spread Debugging" Knowledge Base article here:

https://my.vertica.com/kb/Spread-Debugging/Content/BestPractices/Spread-Debugging.htm

Note: If Spread on one node stops communicating with the other Spreads in the cluster, the Spread daemon removes that node from the cluster membership. The Spread daemon waits for 8 seconds before removing nodes from the membership.

icyvivek · June 2018

HI Jim,

Is this interval (8 seconds) configurable ? I am considering a scenario where there is a network disconnect for lets say 15-20 seconds. If we know this beforehand, we can increase this value to avoid the node outage.

Jim_Knicely · June 2018

In a non-Vertica deployment of Spread you can modify the token timeout.

See: http://www.spread.org/docs/guide/users_guide.pdf

However, in a Vertica deployment, we do not support any change of the spread timeout settings, as this would require a custom build of spread, and would likely have unseen consequences and could destabilize the system.

Fyi ... Here is link to a great read titled "Spread Configuration Best Practices":
https://my.vertica.com/kb/SpreadConfigurationBestPractices/Content/SpreadConfigurationBestPractices.htm

Last Good Epoch/Ancient History Mark not advancing

Comments

Leave a Comment