Recovery options after cluster failure

AndySavage · May 2016

We have a 3 node cluster, that is currently not k-safe. 2 nodes crashed and the third shut down. On restarting into recovery the LGE is about 24 hours behind the last update and the cluster attempts to roll back. I am not sure why it was lagging so far behind. Recovery has been rolling back for 4 days and I'm not sure an end is in sight. It is doing something, as the CPU and disk are both bursting on the node that is too far ahead, but it's taking too long.

What options do we have here?

Is it possible to just start the cluster in an inconsistent state? This isn't a live cluster, and we just want to pull as much data as possible off it before retiring it. It doesn't matter too much if the data is patchy or inconsistent.

Is there a way to see recovery progress on a cluster that is recovering? At the moment I don't know if there's a day left, or a year.

Chris8921 · December 2016

Hi Andy

Totally understand you and this scenario. I wish that I can helped before.

You said "What options do we have here?".

We have a 3 node cluster, that is currently not k-safe. 2 nodes crashed and the third shut down. On restarting into recovery the LGE is about 24 hours behind the last update and the cluster attempts to roll back. I am not sure why it was lagging so far behind. Recovery has been rolling back for 4 days and I'm not sure an end is in sight. It is doing something, as the CPU and disk are both bursting on the node that is too far ahead, but it's taking too long.

1-What kind of recovery you are performing on each node?

select recoverbytable from vs_global_settings;

If it's true, stop the recovery and disable the recovery by table if enabled using.(7.2 and higher)

select set_recover_by_table('false');

If your recovery seems not progressing, please stop the node being recovery, you can stop the vertica process running on the node being recovery.

Precisely the following:

1. Stop any scripts that might be doing DDL changes.

2.Monitor the recovery

select node_name, recover_epoch, recovery_phase, current_completed,current_total, is_running from recovery_status where is_running = 't';

select node_name, projection_name, method, status, progress, detail, start_time from projection_recoveries where status='running';

3.If you think the epoch is way behind you may want to check if there is some data corruption, missing information, projections issues and you can create another projections and drop those that are corrupted

4.Run

SELECT * FROM projection_recoveries WHERE progress ILIKE '%running%';

5.Run
select do_tm_task('moveout');
select make_ahm_now(true);

If persist:

6. ssh to the server on which vertica node to be recovered:

empty Vertica data directory, for example:

/home/dbadmin/test/v_test_node0001_data

empty Vertica Catalog directory (NOT the catalog root directory that contains vertica.log, but the Catalog sub directory inside the catalog root directory), for example:

/home/dbadmin/TEST/v_test_node0001_catalog/Catalog

7. Now start the node from admintools

Hope it helps in case you got the same issue.

Rgds

Chris

We're Moving!

Create My New Community Account Now

Recovery options after cluster failure

Comments

Leave a Comment