Is it safe to use -F option on start_db of admintools?
In Vertica 9.2, I use 'admintools -t start_db -d defaultdb -p defaultdb". I use Ansible playbook for starting cluster / database
However, my database cluster (of 3 nodes) randomly reports failure to start database and I found error message in admintools.log
{'panicmsg': 'Startup Failed, ASR Required'}
I tried admintools UI to rollback to the last good epoch and it works, but its manual step. I'm trying to update my startup script of "admintools -t start_db -F" since I noticed an option "--force" on start_db with the following description -
-F, --force force the database to start at an epoch before data
consistency problems were detected.
I want to know is it safe to call "--force" option always in the script that starts vertica database? Would this option attempts to always use last good epoch (LGE) or does it only if required?
Thanks
Best Answer
-
Btw I think what's really desired here is using "-t restart_db -e last" which asks for a restart using the Last Good Epoch. Nothing additional needed on the command line. But, I don't know what would happen in the extreme cases where LGE might be 0 due to disk issues etc - would need to look at the admintools code (which I haven't done).
Some more background on how LGE is determined, which related to the -t return_epoch usage. During a normal shutdown, every node writes out the LGE to a file. On a subsequent startup, that file is read by admintools and used as the startup epoch (the value possibly negotiated among the nodes if the values differ).
But in the case of an abnormal shutdown - either processes were killed or they exited after an exception - that LGE was not written out to the file. So on the next startup attempt, all nodes will write out the LGE and then shutdown. It then takes another startup attempt asking vertica to start from a specific epoch.
So in Jim's example above, "-t return_epoch" was simply reporting that there's no LGE to return since the nodes exited abnormally.
5

Answers
If ASR is required the -F likely won't help.
Example:
[dbadmin@s18384357 ~]$ vsql -c "SELECT node_name FROM nodes;" node_name -------------------- v_test_db_node0001 (1 row) [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch, refresh_epoch FROM system;" current_epoch | ahm_epoch | last_good_epoch | refresh_epoch ---------------+-----------+-----------------+--------------- 228 | 226 | 226 | -1 (1 row) [dbadmin@s18384357 ~]$ admintools -t kill_node -s v_test_db_node0001 *** Terminating vertica and performing host cleanup *** Terminating vertica processes on host '74.208.100.58' All signals sent successfully. [dbadmin@s18384357 ~]$ admintools -t start_db -d test_db -F Info: no password specified, using none Starting nodes: v_test_db_node0001 (74.208.100.58) Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize. Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Found these errors in startup.logs on hosts: host ['74.208.100.58'] report: Startup Failed, ASR Required Press RETURN to continue Database test_db did not start successfully: Something FailedBut you can restart the DB at the LGE:
[dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db LastEpoch|226 [dbadmin@s18384357 ~]$ admintools -t restart_db -d test_db -e 226 Info: no password specified, using none *** Restarting database test_db at '2019-03-05 23:59:02.136161-05' epoch 226 *** Starting nodes: v_test_db_node0001 (74.208.100.58) Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize. Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (RECOVERING) Node Status: v_test_db_node0001: (UP) Database startup successfulI think the bigger issue is why you need ASR? Is the DB crashing or not being shut down cleanly?
@Jim_Knicely, thanks for the response and clarification. I believe it happened because the DB was not shutdown cleanly.
Am I right in saying that it is legitimate to automate these steps when it fails with "ASR Required"?
As long as you are okay with loosing any DB changes after the LGE, I think it's fine to automate the restart.
Note, you can combine the two calls to admintools like this:
Just replace test_db with your database name.
@Jim_Knicely, sounds good, thanks. One last question -
Does 'return_epoch' option returns current epoch in case of normal situation where database was shutdown properly? Asking this to make sure whether it is possible to combine those 2 calls (return_epoch and restart_db) always which should work in both cases - graceful db shutdown and unclean db shutdown
Yup!
[dbadmin@s18384357 ~]$ admintools -t stop_db -d test_db -F Info: no password specified, using none Connecting to database Issuing shutdown command to database Database test_db stopped successfully [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db LastEpoch|238 [dbadmin@s18384357 ~]$ admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'` Info: no password specified, using none *** Restarting database test_db at '2019-03-06 23:59:03.433907-05' epoch 238 *** Starting nodes: v_test_db_node0001 (74.208.100.58) Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize. Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (RECOVERING) Node Status: v_test_db_node0001: (UP) Database startup successful@hrao -
The return_epoch tool of admintools shows the "Last Good Epoch", not the "Current Epoch". The restart_db too will always try to do ASR (Admin Specified Recovery) to the LGE.
Example on a 3 node cluster:
[dbadmin@SE-Sandbox-43-node1 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;" Timing is on. current_epoch | ahm_epoch | last_good_epoch ---------------+-----------+----------------- 4030 | 4029 | 4029 (1 row) [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t return_epoch -d vmart LastEpoch|4029 [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t stop_db -d vmart -F Info: no password specified, using none Connecting to database Issuing shutdown command to database Database vmart stopped successfully [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t return_epoch -d vmart LastEpoch|4029 [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t restart_db -d vmart -e `admintools -t return_epoch -d vmart | sed 's/LastEpoch|//'` Info: no password specified, using none *** Restarting database vmart at '2019-03-06 16:08:12.518078-05' epoch 4029 *** Starting nodes: v_vmart_node0001 (192.168.61.227) v_vmart_node0002 (192.168.61.228) v_vmart_node0003 (192.168.61.229) Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize. Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Nodes in TRANSITIONAL state: 192.168.61.228, 192.168.61.229, 192.168.61.227 Nodes DOWN: v_vmart_node0001, v_vmart_node0002, v_vmart_node0003 (may be still initializing). It is suggested that you continue waiting. Do you want to continue waiting? (yes/no) [yes] yes Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING) Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING) Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING) Node Status: v_vmart_node0001: (UP) v_vmart_node0002: (UP) v_vmart_node0003: (UP) Database startup successful [dbadmin@SE-Sandbox-43-node1 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;" Timing is on. current_epoch | ahm_epoch | last_good_epoch ---------------+-----------+----------------- 4031 | 4030 | 4030 (1 row)So, return_epoch value at least 1 epoch less than current value. In case of graceful shutdown, we should not use the return value of 'return_epoh' as input to restart_db but it can be used for ASR Required scenario only. Am I right? If so, we can't write a common playbook that covers both graceful shutdown of the database and failure of database for some unknown reason.
When it reports ASR Required, start_db command of admintools hangs without returning. I see timeout option but it is hard to guess a value that should be specified, so don't want to use it.
After a clean shut down the LGE will advance and your data will be safe on disk.
Example:
[dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;" current_epoch | ahm_epoch | last_good_epoch ---------------+-----------+----------------- 240 | 239 | 239 (1 row) [dbadmin@s18384357 ~]$ vsql -c "ALTER DATABASE test_db SET HistoryRetentionTime = 86400;" ALTER DATABASE [dbadmin@s18384357 ~]$ vsql -c "CREATE TABLE lge (c INT); INSERT INTO lge SELECT 1; COMMIT;" COMMIT [dbadmin@s18384357 ~]$ vsql -c "INSERT INTO lge SELECT 1; COMMIT; INSERT INTO lge SELECT 1; COMMIT;" COMMIT [dbadmin@s18384357 ~]$ vsql -c "SELECT * FROM lge;" c --- 1 1 1 (3 rows) [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;" current_epoch | ahm_epoch | last_good_epoch ---------------+-----------+----------------- 243 | 239 | 239 (1 row) [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db LastEpoch|239 [dbadmin@s18384357 ~]$ admintools -t stop_db -d test_db -F Info: no password specified, using none Connecting to database Issuing shutdown command to database Database test_db stopped successfully [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db LastEpoch|242 [dbadmin@s18384357 ~]$ admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'` Info: no password specified, using none *** Restarting database test_db at '2019-03-07 10:33:24.718447-05' epoch 242 *** Starting nodes: v_test_db_node0001 (74.208.100.58) Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize. Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (RECOVERING) Node Status: v_test_db_node0001: (UP) Database startup successful [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;" current_epoch | ahm_epoch | last_good_epoch ---------------+-----------+----------------- 244 | 239 | 243 (1 row) [dbadmin@s18384357 ~]$ vsql -c "SELECT * FROM lge;" c --- 1 1 1 (3 rows)So although Vertica does a RECOVERY step because I requested an ASR, it is not needed here, but I don't think it does anything to cause you to lose committed data. Although, on a large cluster it may increase the time needed to start as Vertica is checking if Recovery is needed on each node.
Another example...
[dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;" current_epoch | ahm_epoch | last_good_epoch ---------------+-----------+----------------- 259 | 239 | 257 (1 row) [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db LastEpoch|257 [dbadmin@s18384357 ~]$ vsql -c "INSERT INTO lge SELECT 1; COMMIT; INSERT INTO lge SELECT 1; COMMIT; INSERT INTO lge SELECT 1; COMMIT;" COMMIT [dbadmin@s18384357 ~]$ vsql -c "DELETE FROM lge; COMMIT;" COMMIT [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;" current_epoch | ahm_epoch | last_good_epoch ---------------+-----------+----------------- 263 | 239 | 257 (1 row) [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db LastEpoch|257 [dbadmin@s18384357 ~]$ admintools -t stop_db -d test_db -F Info: no password specified, using none Connecting to database Issuing shutdown command to database Database test_db stopped successfully [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db LastEpoch|262 [dbadmin@s18384357 ~]$ admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'` Info: no password specified, using none *** Restarting database test_db at '2019-03-07 11:06:03.000156-05' epoch 262 *** Starting nodes: v_test_db_node0001 (74.208.100.58) Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize. Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (DOWN) Node Status: v_test_db_node0001: (RECOVERING) Node Status: v_test_db_node0001: (UP) Database startup successful [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db LastEpoch|263 [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;" current_epoch | ahm_epoch | last_good_epoch ---------------+-----------+----------------- 264 | 239 | 263 (1 row) [dbadmin@s18384357 ~]$ vsql -c "SELECT * FROM lge;" c --- (0 rows)If you automate this, I wouldn't do it in a single combined step. Do it in two separate steps and verify that the epoch returned is not something bogus like 0, which would rollback the database to any empty database. I've seen this situation when something is wrong with the environment, such as disks being inaccessible. Better safe than sorry.
It can also happen - if you aren't monitoring LGE - that LGE can get stuck behind because of data stuck in the WOS that is unable to moveout. This is another case where you could rollback weeks or months of data. In this case if there's a single table or projection that is holding back LGE, the technical support team can help to disable that single table or projection to allow the database to come up without rolling back.
Yeah! Heed Sharons' warnings! Especially about the LGE lagging! I of course was making the assumption the LGE was moving along as expected
See:
https://www.vertica.com/blog/ancient-history-mark-not-advancing/
https://www.vertica.com/kb/Node-Recovery-in-Vertica/Content/BestPractices/Node-Recovery-in-Vertica.htm
Thanks @Sharon_Cutter for the insights and caution. Looks like manual intervention is better when ASR Required is reported.
@Jim_Knicely, since getting correct LGE values is not guaranteed from 'return_epoch, and depending it (even filtering Zero) to roll back to non-zero LGE value has a potential to loose some data. Is my understanding correct?
That's why it's called ASR - Admin Specified Recovery, where you are the Admin
Here is an example where LGE did not advance and DB goes down:
dbadmin->vmart@sandbox1=> CREATE TABLE jim (c INT); CREATE TABLE dbadmin->vmart@sandbox1=> SELECT current_epoch, ahm_epoch, last_good_epoch FROM system; current_epoch | ahm_epoch | last_good_epoch ---------------+-----------+----------------- 4094 | 4092 | 4093 (1 row)Insert some data into using WOS:
dbadmin->vmart@sandbox1=>* INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT; OUTPUT -------- 1 (1 row) COMMIT OUTPUT -------- 1 (1 row) COMMIT OUTPUT -------- 1 (1 row) COMMIT OUTPUT -------- 1 (1 row) COMMIT OUTPUT -------- 1 (1 row) COMMIT dbadmin->vmart@sandbox1=> SELECT current_epoch, ahm_epoch, last_good_epoch FROM system; current_epoch | ahm_epoch | last_good_epoch ---------------+-----------+----------------- 4099 | 4092 | 4093 (1 row) dbadmin->vmart@sandbox1=>* \q [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t return_epoch -d vmart LastEpoch|4093Kill the DB:
[dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t kill_node -s 192.168.61.227,192.168.61.228,192.168.61.229 *** Terminating vertica and performing host cleanup *** Terminating vertica processes on host '192.168.61.227' Terminating vertica processes on host '192.168.61.228' Terminating vertica processes on host '192.168.61.229' All signals sent successfully. [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t return_epoch -d vmart Too few nodes are available to restart cluster LastEpoch|Uh oh! admintools cannot report the LGE! At this point you should involve support to minimize any data loss!
In my example, I don't care about data loss, so I can get admintools to find an LGE like this:
[dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t start_db -d vmart -F Info: no password specified, using none Starting nodes: v_vmart_node0001 (192.168.61.227) v_vmart_node0002 (192.168.61.228) v_vmart_node0003 (192.168.61.229) Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize. Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (INITIALIZING) v_vmart_node0002: (INITIALIZING) v_vmart_node0003: (INITIALIZING) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Found these errors in startup.logs on hosts: hosts ['192.168.61.228', '192.168.61.229', '192.168.61.227'] report: Startup Failed, ASR Required Press RETURN to continue Database vmart did not start successfully: Something Failed [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t return_epoch -d vmart LastEpoch|4093 [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t restart_db -d vmart -e `admintools -t return_epoch -d vmart | sed 's/LastEpoch|//'` Info: no password specified, using none *** Restarting database vmart at '2019-03-09 08:23:32.571289-05' epoch 4093 *** Starting nodes: v_vmart_node0001 (192.168.61.227) v_vmart_node0002 (192.168.61.228) v_vmart_node0003 (192.168.61.229) Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize. Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN) Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING) Node Status: v_vmart_node0001: (UP) v_vmart_node0002: (UP) v_vmart_node0003: (UP) Database startup successfulBut, I lost my commits on the inserts into the JIM table because the committed data was in the WOS.