Is it safe to use -F option on start_db of admintools?
In Vertica 9.2, I use 'admintools -t start_db -d defaultdb -p defaultdb". I use Ansible playbook for starting cluster / database
However, my database cluster (of 3 nodes) randomly reports failure to start database and I found error message in admintools.log
{'panicmsg': 'Startup Failed, ASR Required'}
I tried admintools UI to rollback to the last good epoch and it works, but its manual step. I'm trying to update my startup script of "admintools -t start_db -F" since I noticed an option "--force" on start_db with the following description -
-F, --force force the database to start at an epoch before data
consistency problems were detected.
I want to know is it safe to call "--force" option always in the script that starts vertica database? Would this option attempts to always use last good epoch (LGE) or does it only if required?
Thanks
Best Answer
-
Btw I think what's really desired here is using "-t restart_db -e last" which asks for a restart using the Last Good Epoch. Nothing additional needed on the command line. But, I don't know what would happen in the extreme cases where LGE might be 0 due to disk issues etc - would need to look at the admintools code (which I haven't done).
Some more background on how LGE is determined, which related to the -t return_epoch usage. During a normal shutdown, every node writes out the LGE to a file. On a subsequent startup, that file is read by admintools and used as the startup epoch (the value possibly negotiated among the nodes if the values differ).
But in the case of an abnormal shutdown - either processes were killed or they exited after an exception - that LGE was not written out to the file. So on the next startup attempt, all nodes will write out the LGE and then shutdown. It then takes another startup attempt asking vertica to start from a specific epoch.
So in Jim's example above, "-t return_epoch" was simply reporting that there's no LGE to return since the nodes exited abnormally.
5
Answers
If ASR is required the -F likely won't help.
Example:
But you can restart the DB at the LGE:
I think the bigger issue is why you need ASR? Is the DB crashing or not being shut down cleanly?
@Jim_Knicely, thanks for the response and clarification. I believe it happened because the DB was not shutdown cleanly.
Am I right in saying that it is legitimate to automate these steps when it fails with "ASR Required"?
As long as you are okay with loosing any DB changes after the LGE, I think it's fine to automate the restart.
Note, you can combine the two calls to admintools like this:
Just replace test_db with your database name.
@Jim_Knicely, sounds good, thanks. One last question -
Does 'return_epoch' option returns current epoch in case of normal situation where database was shutdown properly? Asking this to make sure whether it is possible to combine those 2 calls (return_epoch and restart_db) always which should work in both cases - graceful db shutdown and unclean db shutdown
Yup!
@hrao -
The return_epoch tool of admintools shows the "Last Good Epoch", not the "Current Epoch". The restart_db too will always try to do ASR (Admin Specified Recovery) to the LGE.
Example on a 3 node cluster:
So, return_epoch value at least 1 epoch less than current value. In case of graceful shutdown, we should not use the return value of 'return_epoh' as input to restart_db but it can be used for ASR Required scenario only. Am I right? If so, we can't write a common playbook that covers both graceful shutdown of the database and failure of database for some unknown reason.
When it reports ASR Required, start_db command of admintools hangs without returning. I see timeout option but it is hard to guess a value that should be specified, so don't want to use it.
After a clean shut down the LGE will advance and your data will be safe on disk.
Example:
So although Vertica does a RECOVERY step because I requested an ASR, it is not needed here, but I don't think it does anything to cause you to lose committed data. Although, on a large cluster it may increase the time needed to start as Vertica is checking if Recovery is needed on each node.
Another example...
If you automate this, I wouldn't do it in a single combined step. Do it in two separate steps and verify that the epoch returned is not something bogus like 0, which would rollback the database to any empty database. I've seen this situation when something is wrong with the environment, such as disks being inaccessible. Better safe than sorry.
It can also happen - if you aren't monitoring LGE - that LGE can get stuck behind because of data stuck in the WOS that is unable to moveout. This is another case where you could rollback weeks or months of data. In this case if there's a single table or projection that is holding back LGE, the technical support team can help to disable that single table or projection to allow the database to come up without rolling back.
Yeah! Heed Sharons' warnings! Especially about the LGE lagging! I of course was making the assumption the LGE was moving along as expected
See:
https://www.vertica.com/blog/ancient-history-mark-not-advancing/
https://www.vertica.com/kb/Node-Recovery-in-Vertica/Content/BestPractices/Node-Recovery-in-Vertica.htm
Thanks @Sharon_Cutter for the insights and caution. Looks like manual intervention is better when ASR Required is reported.
@Jim_Knicely, since getting correct LGE values is not guaranteed from 'return_epoch, and depending it (even filtering Zero) to roll back to non-zero LGE value has a potential to loose some data. Is my understanding correct?
That's why it's called ASR - Admin Specified Recovery, where you are the Admin
Here is an example where LGE did not advance and DB goes down:
Insert some data into using WOS:
Kill the DB:
Uh oh! admintools cannot report the LGE! At this point you should involve support to minimize any data loss!
In my example, I don't care about data loss, so I can get admintools to find an LGE like this:
But, I lost my commits on the inserts into the JIM table because the committed data was in the WOS.