Is it safe to use -F option on start_db of admintools?

hraohrao Registered User

In Vertica 9.2, I use 'admintools -t start_db -d defaultdb -p defaultdb". I use Ansible playbook for starting cluster / database

However, my database cluster (of 3 nodes) randomly reports failure to start database and I found error message in admintools.log
{'panicmsg': 'Startup Failed, ASR Required'}

I tried admintools UI to rollback to the last good epoch and it works, but its manual step. I'm trying to update my startup script of "admintools -t start_db -F" since I noticed an option "--force" on start_db with the following description -
-F, --force force the database to start at an epoch before data
consistency problems were detected.

I want to know is it safe to call "--force" option always in the script that starts vertica database? Would this option attempts to always use last good epoch (LGE) or does it only if required?
Thanks

Best Answer

Answers

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    If ASR is required the -F likely won't help.

    Example:

    [[email protected] ~]$ vsql -c "SELECT node_name FROM nodes;"
         node_name
    --------------------
     v_test_db_node0001
    (1 row)
    
    [[email protected] ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch, refresh_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch | refresh_epoch
    ---------------+-----------+-----------------+---------------
               228 |       226 |             226 |            -1
    (1 row)
    
    [[email protected] ~]$ admintools -t kill_node -s v_test_db_node0001
    *** Terminating vertica and performing host cleanup ***
            Terminating vertica processes on host '74.208.100.58'
    All signals sent successfully.
    
    [[email protected]8384357 ~]$ admintools -t start_db -d test_db -F
    Info: no password specified, using none
            Starting nodes:
                    v_test_db_node0001 (74.208.100.58)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
    Found these errors in startup.logs on hosts:
    host ['74.208.100.58'] report: Startup Failed, ASR Required
    Press RETURN to continue
    Database test_db did not start successfully: Something Failed
    

    But you can restart the DB at the LGE:

    [[email protected] ~]$ admintools -t return_epoch -d test_db
    LastEpoch|226
    
    [[email protected] ~]$ admintools -t restart_db -d test_db -e 226
    Info: no password specified, using none
    *** Restarting database test_db at '2019-03-05 23:59:02.136161-05' epoch 226 ***
            Starting nodes:
                    v_test_db_node0001 (74.208.100.58)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (RECOVERING)
            Node Status: v_test_db_node0001: (UP)
    Database startup successful
    

    I think the bigger issue is why you need ASR? Is the DB crashing or not being shut down cleanly?

  • hraohrao Registered User
    edited March 6

    @Jim_Knicely, thanks for the response and clarification. I believe it happened because the DB was not shutdown cleanly.
    Am I right in saying that it is legitimate to automate these steps when it fails with "ASR Required"?

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    As long as you are okay with loosing any DB changes after the LGE, I think it's fine to automate the restart.

    Note, you can combine the two calls to admintools like this:

    admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'`
    

    Just replace test_db with your database name.

  • hraohrao Registered User

    @Jim_Knicely, sounds good, thanks. One last question -
    Does 'return_epoch' option returns current epoch in case of normal situation where database was shutdown properly? Asking this to make sure whether it is possible to combine those 2 calls (return_epoch and restart_db) always which should work in both cases - graceful db shutdown and unclean db shutdown

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    Yup!

    [[email protected] ~]$ admintools -t stop_db -d test_db -F
    Info: no password specified, using none
            Connecting to database
            Issuing shutdown command to database
    Database test_db stopped successfully
    
    [[email protected] ~]$ admintools -t return_epoch -d test_db
    LastEpoch|238
    
    [[email protected] ~]$ admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'`
    Info: no password specified, using none
    *** Restarting database test_db at '2019-03-06 23:59:03.433907-05' epoch 238 ***
            Starting nodes:
                    v_test_db_node0001 (74.208.100.58)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (RECOVERING)
            Node Status: v_test_db_node0001: (UP)
    Database startup successful
    
  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    @hrao -

    The return_epoch tool of admintools shows the "Last Good Epoch", not the "Current Epoch". The restart_db too will always try to do ASR (Admin Specified Recovery) to the LGE.

    Example on a 3 node cluster:

    [[email protected] ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
    Timing is on.
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
              4030 |      4029 |            4029
    (1 row)
    
    [[email protected] ~]$ admintools -t return_epoch -d vmart
    LastEpoch|4029
    
    [[email protected] ~]$ admintools -t stop_db -d vmart -F
    Info: no password specified, using none
            Connecting to database
            Issuing shutdown command to database
    Database vmart stopped successfully
    
    [[email protected] ~]$ admintools -t return_epoch -d vmart
    LastEpoch|4029
    
    [[email protected] ~]$ admintools -t restart_db -d vmart -e `admintools -t return_epoch -d vmart | sed 's/LastEpoch|//'`
    Info: no password specified, using none
    *** Restarting database vmart at '2019-03-06 16:08:12.518078-05' epoch 4029 ***
            Starting nodes:
                    v_vmart_node0001 (192.168.61.227)
                    v_vmart_node0002 (192.168.61.228)
                    v_vmart_node0003 (192.168.61.229)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
    Nodes in TRANSITIONAL state: 192.168.61.228, 192.168.61.229, 192.168.61.227
    Nodes DOWN: v_vmart_node0001, v_vmart_node0002, v_vmart_node0003 (may be still initializing).
    It is suggested that you continue waiting.
    Do you want to continue waiting? (yes/no) [yes] yes
            Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING)
            Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING)
            Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING)
            Node Status: v_vmart_node0001: (UP) v_vmart_node0002: (UP) v_vmart_node0003: (UP)
    Database startup successful
    
    [[email protected] ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
    Timing is on.
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
              4031 |      4030 |            4030
    (1 row)
    
  • hraohrao Registered User

    So, return_epoch value at least 1 epoch less than current value. In case of graceful shutdown, we should not use the return value of 'return_epoh' as input to restart_db but it can be used for ASR Required scenario only. Am I right? If so, we can't write a common playbook that covers both graceful shutdown of the database and failure of database for some unknown reason.

    When it reports ASR Required, start_db command of admintools hangs without returning. I see timeout option but it is hard to guess a value that should be specified, so don't want to use it.

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert
    edited March 7

    After a clean shut down the LGE will advance and your data will be safe on disk.

    Example:

    [[email protected] ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               240 |       239 |             239
    (1 row)
    
    [[email protected] ~]$ vsql -c "ALTER DATABASE test_db SET HistoryRetentionTime = 86400;"
    ALTER DATABASE
    
    [[email protected] ~]$ vsql -c "CREATE TABLE lge (c INT); INSERT INTO lge SELECT 1; COMMIT;"
    COMMIT
    
    [[email protected] ~]$ vsql -c "INSERT INTO lge SELECT 1; COMMIT; INSERT INTO lge SELECT 1; COMMIT;"
    COMMIT
    
    [[email protected] ~]$ vsql -c "SELECT * FROM lge;"
     c
    ---
     1
     1
     1
    (3 rows)
    
    [[email protected] ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               243 |       239 |             239
    (1 row)
    
    [[email protected] ~]$ admintools -t return_epoch -d test_db
    LastEpoch|239
    
    [[email protected] ~]$ admintools -t stop_db -d test_db -F
    Info: no password specified, using none
            Connecting to database
            Issuing shutdown command to database
    Database test_db stopped successfully
    
    [[email protected] ~]$ admintools -t return_epoch -d test_db
    LastEpoch|242
    
    [[email protected] ~]$ admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'`
    Info: no password specified, using none
    *** Restarting database test_db at '2019-03-07 10:33:24.718447-05' epoch 242 ***
            Starting nodes:
                    v_test_db_node0001 (74.208.100.58)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (RECOVERING)
            Node Status: v_test_db_node0001: (UP)
    Database startup successful
    
    [[email protected] ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               244 |       239 |             243
    (1 row)
    
    [[email protected] ~]$ vsql -c "SELECT * FROM lge;"
     c
    ---
     1
     1
     1
    (3 rows)
    

    So although Vertica does a RECOVERY step because I requested an ASR, it is not needed here, but I don't think it does anything to cause you to lose committed data. Although, on a large cluster it may increase the time needed to start as Vertica is checking if Recovery is needed on each node.

    Another example...

    [[email protected] ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               259 |       239 |             257
    (1 row)
    
    [[email protected] ~]$ admintools -t return_epoch -d test_db
    LastEpoch|257
    
    [[email protected] ~]$ vsql -c "INSERT INTO lge SELECT 1; COMMIT; INSERT INTO lge SELECT 1; COMMIT; INSERT INTO lge SELECT 1; COMMIT;"
    COMMIT
    
    [[email protected] ~]$ vsql -c "DELETE FROM lge; COMMIT;"
    COMMIT
    
    [[email protected] ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               263 |       239 |             257
    (1 row)
    
    [[email protected] ~]$ admintools -t return_epoch -d test_db
    LastEpoch|257
    
    [[email protected] ~]$ admintools -t stop_db -d test_db -F
    Info: no password specified, using none
            Connecting to database
            Issuing shutdown command to database
    Database test_db stopped successfully
    
    [[email protected] ~]$ admintools -t return_epoch -d test_db
    LastEpoch|262
    
    [[email protected] ~]$ admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'`
    Info: no password specified, using none
    *** Restarting database test_db at '2019-03-07 11:06:03.000156-05' epoch 262 ***
            Starting nodes:
                    v_test_db_node0001 (74.208.100.58)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (RECOVERING)
            Node Status: v_test_db_node0001: (UP)
    Database startup successful
    
    [[email protected] ~]$ admintools -t return_epoch -d test_db
    LastEpoch|263
    [[email protected] ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               264 |       239 |             263
    (1 row)
    
    [[email protected] ~]$ vsql -c "SELECT * FROM lge;"
     c
    ---
    (0 rows)
    
  • Sharon_CutterSharon_Cutter Registered User

    If you automate this, I wouldn't do it in a single combined step. Do it in two separate steps and verify that the epoch returned is not something bogus like 0, which would rollback the database to any empty database. I've seen this situation when something is wrong with the environment, such as disks being inaccessible. Better safe than sorry.

    It can also happen - if you aren't monitoring LGE - that LGE can get stuck behind because of data stuck in the WOS that is unable to moveout. This is another case where you could rollback weeks or months of data. In this case if there's a single table or projection that is holding back LGE, the technical support team can help to disable that single table or projection to allow the database to come up without rolling back.

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert
    edited March 8

    Yeah! Heed Sharons' warnings! Especially about the LGE lagging! I of course was making the assumption the LGE was moving along as expected :)

    See:
    https://www.vertica.com/blog/ancient-history-mark-not-advancing/
    https://www.vertica.com/kb/Node-Recovery-in-Vertica/Content/BestPractices/Node-Recovery-in-Vertica.htm

  • hraohrao Registered User

    Thanks @Sharon_Cutter for the insights and caution. Looks like manual intervention is better when ASR Required is reported.
    @Jim_Knicely, since getting correct LGE values is not guaranteed from 'return_epoch, and depending it (even filtering Zero) to roll back to non-zero LGE value has a potential to loose some data. Is my understanding correct?

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert
    edited March 9

    That's why it's called ASR - Admin Specified Recovery, where you are the Admin :)

    Here is an example where LGE did not advance and DB goes down:

    dbadmin->[email protected]=> CREATE TABLE jim (c INT);
    CREATE TABLE
    
    dbadmin->[email protected]=> SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
              4094 |      4092 |            4093
    (1 row)
    

    Insert some data into using WOS:

    dbadmin->[email protected]=>* INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT;
     OUTPUT
    --------
          1
    (1 row)
    
    COMMIT
     OUTPUT
    --------
          1
    (1 row)
    
    COMMIT
     OUTPUT
    --------
          1
    (1 row)
    
    COMMIT
     OUTPUT
    --------
          1
    (1 row)
    
    COMMIT
     OUTPUT
    --------
          1
    (1 row)
    
    COMMIT
    dbadmin->[email protected]=> SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
              4099 |      4092 |            4093
    (1 row)
    
    dbadmin->[email protected]=>* \q
    
    [[email protected] ~]$ admintools -t return_epoch -d vmart
    LastEpoch|4093
    

    Kill the DB:

    [[email protected] ~]$ admintools -t kill_node -s 192.168.61.227,192.168.61.228,192.168.61.229
    *** Terminating vertica and performing host cleanup ***
            Terminating vertica processes on host '192.168.61.227'
            Terminating vertica processes on host '192.168.61.228'
            Terminating vertica processes on host '192.168.61.229'
    All signals sent successfully.
    
    [[email protected] ~]$ admintools -t return_epoch -d vmart
    Too few nodes are available to restart cluster
    LastEpoch|
    

    Uh oh! admintools cannot report the LGE! At this point you should involve support to minimize any data loss!

    In my example, I don't care about data loss, so I can get admintools to find an LGE like this:

    [[email protected] ~]$ admintools -t start_db -d vmart -F
    Info: no password specified, using none
            Starting nodes:
                    v_vmart_node0001 (192.168.61.227)
                    v_vmart_node0002 (192.168.61.228)
                    v_vmart_node0003 (192.168.61.229)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (INITIALIZING) v_vmart_node0002: (INITIALIZING) v_vmart_node0003: (INITIALIZING)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
    Found these errors in startup.logs on hosts:
    hosts ['192.168.61.228', '192.168.61.229', '192.168.61.227'] report: Startup Failed, ASR Required
    Press RETURN to continue
    Database vmart did not start successfully: Something Failed
    
    [[email protected] ~]$ admintools -t return_epoch -d vmart
    LastEpoch|4093
    
    [[email protected] ~]$ admintools -t restart_db -d vmart -e `admintools -t return_epoch -d vmart | sed 's/LastEpoch|//'`
    Info: no password specified, using none
    *** Restarting database vmart at '2019-03-09 08:23:32.571289-05' epoch 4093 ***
            Starting nodes:
                    v_vmart_node0001 (192.168.61.227)
                    v_vmart_node0002 (192.168.61.228)
                    v_vmart_node0003 (192.168.61.229)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING)
            Node Status: v_vmart_node0001: (UP) v_vmart_node0002: (UP) v_vmart_node0003: (UP)
    Database startup successful
    

    But, I lost my commits on the inserts into the JIM table because the committed data was in the WOS.

    [[email protected] ~]$ vsql -c "SELECT * FROM jim;"
     c
    ---
    (0 rows)
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file