Recovery options after cluster failure
We have a 3 node cluster, that is currently not k-safe. 2 nodes crashed and the third shut down. On restarting into recovery the LGE is about 24 hours behind the last update and the cluster attempts to roll back. I am not sure why it was lagging so far behind. Recovery has been rolling back for 4 days and I'm not sure an end is in sight. It is doing something, as the CPU and disk are both bursting on the node that is too far ahead, but it's taking too long.
What options do we have here?
Is it possible to just start the cluster in an inconsistent state? This isn't a live cluster, and we just want to pull as much data as possible off it before retiring it. It doesn't matter too much if the data is patchy or inconsistent.
Is there a way to see recovery progress on a cluster that is recovering? At the moment I don't know if there's a day left, or a year.