Is it safe to use -F option on start_db of admintools?

In Vertica 9.2, I use 'admintools -t start_db -d defaultdb -p defaultdb". I use Ansible playbook for starting cluster / database

However, my database cluster (of 3 nodes) randomly reports failure to start database and I found error message in admintools.log
{'panicmsg': 'Startup Failed, ASR Required'}

I tried admintools UI to rollback to the last good epoch and it works, but its manual step. I'm trying to update my startup script of "admintools -t start_db -F" since I noticed an option "--force" on start_db with the following description -
-F, --force force the database to start at an epoch before data
consistency problems were detected.

I want to know is it safe to call "--force" option always in the script that starts vertica database? Would this option attempts to always use last good epoch (LGE) or does it only if required?
Thanks

Best Answer

Answers

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    If ASR is required the -F likely won't help.

    Example:

    [dbadmin@s18384357 ~]$ vsql -c "SELECT node_name FROM nodes;"
         node_name
    --------------------
     v_test_db_node0001
    (1 row)
    
    [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch, refresh_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch | refresh_epoch
    ---------------+-----------+-----------------+---------------
               228 |       226 |             226 |            -1
    (1 row)
    
    [dbadmin@s18384357 ~]$ admintools -t kill_node -s v_test_db_node0001
    *** Terminating vertica and performing host cleanup ***
            Terminating vertica processes on host '74.208.100.58'
    All signals sent successfully.
    
    [dbadmin@s18384357 ~]$ admintools -t start_db -d test_db -F
    Info: no password specified, using none
            Starting nodes:
                    v_test_db_node0001 (74.208.100.58)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
    Found these errors in startup.logs on hosts:
    host ['74.208.100.58'] report: Startup Failed, ASR Required
    Press RETURN to continue
    Database test_db did not start successfully: Something Failed
    

    But you can restart the DB at the LGE:

    [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db
    LastEpoch|226
    
    [dbadmin@s18384357 ~]$ admintools -t restart_db -d test_db -e 226
    Info: no password specified, using none
    *** Restarting database test_db at '2019-03-05 23:59:02.136161-05' epoch 226 ***
            Starting nodes:
                    v_test_db_node0001 (74.208.100.58)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (RECOVERING)
            Node Status: v_test_db_node0001: (UP)
    Database startup successful
    

    I think the bigger issue is why you need ASR? Is the DB crashing or not being shut down cleanly?

  • hraohrao
    edited March 2019

    @Jim_Knicely, thanks for the response and clarification. I believe it happened because the DB was not shutdown cleanly.
    Am I right in saying that it is legitimate to automate these steps when it fails with "ASR Required"?

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    As long as you are okay with loosing any DB changes after the LGE, I think it's fine to automate the restart.

    Note, you can combine the two calls to admintools like this:

    admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'`
    

    Just replace test_db with your database name.

  • @Jim_Knicely, sounds good, thanks. One last question -
    Does 'return_epoch' option returns current epoch in case of normal situation where database was shutdown properly? Asking this to make sure whether it is possible to combine those 2 calls (return_epoch and restart_db) always which should work in both cases - graceful db shutdown and unclean db shutdown

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Yup!

    [dbadmin@s18384357 ~]$ admintools -t stop_db -d test_db -F
    Info: no password specified, using none
            Connecting to database
            Issuing shutdown command to database
    Database test_db stopped successfully
    
    [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db
    LastEpoch|238
    
    [dbadmin@s18384357 ~]$ admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'`
    Info: no password specified, using none
    *** Restarting database test_db at '2019-03-06 23:59:03.433907-05' epoch 238 ***
            Starting nodes:
                    v_test_db_node0001 (74.208.100.58)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (RECOVERING)
            Node Status: v_test_db_node0001: (UP)
    Database startup successful
    
  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    @hrao -

    The return_epoch tool of admintools shows the "Last Good Epoch", not the "Current Epoch". The restart_db too will always try to do ASR (Admin Specified Recovery) to the LGE.

    Example on a 3 node cluster:

    [dbadmin@SE-Sandbox-43-node1 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
    Timing is on.
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
              4030 |      4029 |            4029
    (1 row)
    
    [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t return_epoch -d vmart
    LastEpoch|4029
    
    [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t stop_db -d vmart -F
    Info: no password specified, using none
            Connecting to database
            Issuing shutdown command to database
    Database vmart stopped successfully
    
    [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t return_epoch -d vmart
    LastEpoch|4029
    
    [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t restart_db -d vmart -e `admintools -t return_epoch -d vmart | sed 's/LastEpoch|//'`
    Info: no password specified, using none
    *** Restarting database vmart at '2019-03-06 16:08:12.518078-05' epoch 4029 ***
            Starting nodes:
                    v_vmart_node0001 (192.168.61.227)
                    v_vmart_node0002 (192.168.61.228)
                    v_vmart_node0003 (192.168.61.229)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
    Nodes in TRANSITIONAL state: 192.168.61.228, 192.168.61.229, 192.168.61.227
    Nodes DOWN: v_vmart_node0001, v_vmart_node0002, v_vmart_node0003 (may be still initializing).
    It is suggested that you continue waiting.
    Do you want to continue waiting? (yes/no) [yes] yes
            Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING)
            Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING)
            Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING)
            Node Status: v_vmart_node0001: (UP) v_vmart_node0002: (UP) v_vmart_node0003: (UP)
    Database startup successful
    
    [dbadmin@SE-Sandbox-43-node1 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
    Timing is on.
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
              4031 |      4030 |            4030
    (1 row)
    
  • So, return_epoch value at least 1 epoch less than current value. In case of graceful shutdown, we should not use the return value of 'return_epoh' as input to restart_db but it can be used for ASR Required scenario only. Am I right? If so, we can't write a common playbook that covers both graceful shutdown of the database and failure of database for some unknown reason.

    When it reports ASR Required, start_db command of admintools hangs without returning. I see timeout option but it is hard to guess a value that should be specified, so don't want to use it.

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited March 2019

    After a clean shut down the LGE will advance and your data will be safe on disk.

    Example:

    [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               240 |       239 |             239
    (1 row)
    
    [dbadmin@s18384357 ~]$ vsql -c "ALTER DATABASE test_db SET HistoryRetentionTime = 86400;"
    ALTER DATABASE
    
    [dbadmin@s18384357 ~]$ vsql -c "CREATE TABLE lge (c INT); INSERT INTO lge SELECT 1; COMMIT;"
    COMMIT
    
    [dbadmin@s18384357 ~]$ vsql -c "INSERT INTO lge SELECT 1; COMMIT; INSERT INTO lge SELECT 1; COMMIT;"
    COMMIT
    
    [dbadmin@s18384357 ~]$ vsql -c "SELECT * FROM lge;"
     c
    ---
     1
     1
     1
    (3 rows)
    
    [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               243 |       239 |             239
    (1 row)
    
    [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db
    LastEpoch|239
    
    [dbadmin@s18384357 ~]$ admintools -t stop_db -d test_db -F
    Info: no password specified, using none
            Connecting to database
            Issuing shutdown command to database
    Database test_db stopped successfully
    
    [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db
    LastEpoch|242
    
    [dbadmin@s18384357 ~]$ admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'`
    Info: no password specified, using none
    *** Restarting database test_db at '2019-03-07 10:33:24.718447-05' epoch 242 ***
            Starting nodes:
                    v_test_db_node0001 (74.208.100.58)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (RECOVERING)
            Node Status: v_test_db_node0001: (UP)
    Database startup successful
    
    [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               244 |       239 |             243
    (1 row)
    
    [dbadmin@s18384357 ~]$ vsql -c "SELECT * FROM lge;"
     c
    ---
     1
     1
     1
    (3 rows)
    

    So although Vertica does a RECOVERY step because I requested an ASR, it is not needed here, but I don't think it does anything to cause you to lose committed data. Although, on a large cluster it may increase the time needed to start as Vertica is checking if Recovery is needed on each node.

    Another example...

    [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               259 |       239 |             257
    (1 row)
    
    [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db
    LastEpoch|257
    
    [dbadmin@s18384357 ~]$ vsql -c "INSERT INTO lge SELECT 1; COMMIT; INSERT INTO lge SELECT 1; COMMIT; INSERT INTO lge SELECT 1; COMMIT;"
    COMMIT
    
    [dbadmin@s18384357 ~]$ vsql -c "DELETE FROM lge; COMMIT;"
    COMMIT
    
    [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               263 |       239 |             257
    (1 row)
    
    [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db
    LastEpoch|257
    
    [dbadmin@s18384357 ~]$ admintools -t stop_db -d test_db -F
    Info: no password specified, using none
            Connecting to database
            Issuing shutdown command to database
    Database test_db stopped successfully
    
    [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db
    LastEpoch|262
    
    [dbadmin@s18384357 ~]$ admintools -t restart_db -d test_db -e `admintools -t return_epoch -d test_db | sed 's/LastEpoch|//'`
    Info: no password specified, using none
    *** Restarting database test_db at '2019-03-07 11:06:03.000156-05' epoch 262 ***
            Starting nodes:
                    v_test_db_node0001 (74.208.100.58)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (DOWN)
            Node Status: v_test_db_node0001: (RECOVERING)
            Node Status: v_test_db_node0001: (UP)
    Database startup successful
    
    [dbadmin@s18384357 ~]$ admintools -t return_epoch -d test_db
    LastEpoch|263
    [dbadmin@s18384357 ~]$ vsql -c "SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;"
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
               264 |       239 |             263
    (1 row)
    
    [dbadmin@s18384357 ~]$ vsql -c "SELECT * FROM lge;"
     c
    ---
    (0 rows)
    
  • If you automate this, I wouldn't do it in a single combined step. Do it in two separate steps and verify that the epoch returned is not something bogus like 0, which would rollback the database to any empty database. I've seen this situation when something is wrong with the environment, such as disks being inaccessible. Better safe than sorry.

    It can also happen - if you aren't monitoring LGE - that LGE can get stuck behind because of data stuck in the WOS that is unable to moveout. This is another case where you could rollback weeks or months of data. In this case if there's a single table or projection that is holding back LGE, the technical support team can help to disable that single table or projection to allow the database to come up without rolling back.

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited March 2019

    Yeah! Heed Sharons' warnings! Especially about the LGE lagging! I of course was making the assumption the LGE was moving along as expected :)

    See:
    https://www.vertica.com/blog/ancient-history-mark-not-advancing/
    https://www.vertica.com/kb/Node-Recovery-in-Vertica/Content/BestPractices/Node-Recovery-in-Vertica.htm

  • Thanks @Sharon_Cutter for the insights and caution. Looks like manual intervention is better when ASR Required is reported.
    @Jim_Knicely, since getting correct LGE values is not guaranteed from 'return_epoch, and depending it (even filtering Zero) to roll back to non-zero LGE value has a potential to loose some data. Is my understanding correct?

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited March 2019

    That's why it's called ASR - Admin Specified Recovery, where you are the Admin :)

    Here is an example where LGE did not advance and DB goes down:

    dbadmin->vmart@sandbox1=> CREATE TABLE jim (c INT);
    CREATE TABLE
    
    dbadmin->vmart@sandbox1=> SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
              4094 |      4092 |            4093
    (1 row)
    

    Insert some data into using WOS:

    dbadmin->vmart@sandbox1=>* INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT; INSERT INTO jim SELECT 1; COMMIT;
     OUTPUT
    --------
          1
    (1 row)
    
    COMMIT
     OUTPUT
    --------
          1
    (1 row)
    
    COMMIT
     OUTPUT
    --------
          1
    (1 row)
    
    COMMIT
     OUTPUT
    --------
          1
    (1 row)
    
    COMMIT
     OUTPUT
    --------
          1
    (1 row)
    
    COMMIT
    dbadmin->vmart@sandbox1=> SELECT current_epoch, ahm_epoch, last_good_epoch FROM system;
     current_epoch | ahm_epoch | last_good_epoch
    ---------------+-----------+-----------------
              4099 |      4092 |            4093
    (1 row)
    
    dbadmin->vmart@sandbox1=>* \q
    
    [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t return_epoch -d vmart
    LastEpoch|4093
    

    Kill the DB:

    [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t kill_node -s 192.168.61.227,192.168.61.228,192.168.61.229
    *** Terminating vertica and performing host cleanup ***
            Terminating vertica processes on host '192.168.61.227'
            Terminating vertica processes on host '192.168.61.228'
            Terminating vertica processes on host '192.168.61.229'
    All signals sent successfully.
    
    [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t return_epoch -d vmart
    Too few nodes are available to restart cluster
    LastEpoch|
    

    Uh oh! admintools cannot report the LGE! At this point you should involve support to minimize any data loss!

    In my example, I don't care about data loss, so I can get admintools to find an LGE like this:

    [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t start_db -d vmart -F
    Info: no password specified, using none
            Starting nodes:
                    v_vmart_node0001 (192.168.61.227)
                    v_vmart_node0002 (192.168.61.228)
                    v_vmart_node0003 (192.168.61.229)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (INITIALIZING) v_vmart_node0002: (INITIALIZING) v_vmart_node0003: (INITIALIZING)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
    Found these errors in startup.logs on hosts:
    hosts ['192.168.61.228', '192.168.61.229', '192.168.61.227'] report: Startup Failed, ASR Required
    Press RETURN to continue
    Database vmart did not start successfully: Something Failed
    
    [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t return_epoch -d vmart
    LastEpoch|4093
    
    [dbadmin@SE-Sandbox-43-node1 ~]$ admintools -t restart_db -d vmart -e `admintools -t return_epoch -d vmart | sed 's/LastEpoch|//'`
    Info: no password specified, using none
    *** Restarting database vmart at '2019-03-09 08:23:32.571289-05' epoch 4093 ***
            Starting nodes:
                    v_vmart_node0001 (192.168.61.227)
                    v_vmart_node0002 (192.168.61.228)
                    v_vmart_node0003 (192.168.61.229)
            Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (DOWN) v_vmart_node0002: (DOWN) v_vmart_node0003: (DOWN)
            Node Status: v_vmart_node0001: (RECOVERING) v_vmart_node0002: (RECOVERING) v_vmart_node0003: (RECOVERING)
            Node Status: v_vmart_node0001: (UP) v_vmart_node0002: (UP) v_vmart_node0003: (UP)
    Database startup successful
    

    But, I lost my commits on the inserts into the JIM table because the committed data was in the WOS.

    [dbadmin@SE-Sandbox-43-node1 ~]$ vsql -c "SELECT * FROM jim;"
     c
    ---
    (0 rows)
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file