Vertica startup failure due to power cut (ASR required)
I am currently facing an issue with the Cisco Stealthwatch application, which utilizes the Vertica database. After experiencing an unclean power down, the database is unable to start. I have attempted to restart the database from the last epoch, but unfortunately, it has not been successful. Upon checking the log file, I have come across the following error message:
"stage" : "Startup Failed, ASR Required",
"text" : "Node Dependencies:\n1 - cnt: 60\n\n1 - name: v_sw_node0001\nNodes certainly in the cluster:\n\tNode 0(v_sw_node0001), epoch 5280530\nFilling more nodes to satisfy node dependencies:\nData dependencies fulfilled, remaining nodes LGEs don't matter:\n--",
I kindly request assistance in resolving this issue.
"stage" : "Startup Failed, ASR Required",
"text" : "Node Dependencies:\n1 - cnt: 60\n\n1 - name: v_sw_node0001\nNodes certainly in the cluster:\n\tNode 0(v_sw_node0001), epoch 5280530\nFilling more nodes to satisfy node dependencies:\nData dependencies fulfilled, remaining nodes LGEs don't matter:\n--",
I kindly request assistance in resolving this issue.
Tagged:
0
Answers
"ASR required" stands for Admin Specified Recovery, and we strongly recommend contacting product support for assistance to avoid data loss if the process is not able to start via start_db or restart_db without this error.
Kind regards
Did you try: admintools -t restart_db -d DBNAME -p DBPASSWORD -e last -i
If the above fails, next try: admintools -t start_db -d DBNAME -p DBPASSWORD -i -F
Let us know the outcome. Hopefully these will start your DB, full ASR is a bit more work...
I have attempted both commands, but unfortunately, neither of them seems to be effective and database is still down. I was wondering if you could provide any suggestions or solutions to resolve this issue.
Thank you in advance for your assistance.
Best regards
@Bryan_H ,
I attempted to execute the following commands as well, but encountered the subsequent error:
admintools -t start_db -d DBname -p DBpassword -U
vsql -U DBuser -w password
=> SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED READ WRITE;
=> ALTER DATABASE DBname SET TransactionMode = 'READ WRITE';
=> select make_ahm_now();
INTERNAL 5445: VIAssert(currAHMEpoch <= clusterLGE) failed
DETAIL: /data/jenkins/workspace/RE-ReleaseBuilds/RE-Jackhammer/server/vertica/Commands/PGCall.cpp: 19244
HINT: Please report this error to Vertica; try restating your query
As Bryan mentioned it is strongly recommended to contact support for assistance to avoid data loss.
Please note that when you start a database from a specific Epoch you might lose data
as this can also roll back any uncommitted transactions.
The following command: admintools -t start_db -d -F
Will likely fail but the act of attempting a "force" repair of the Vertica database will put an Epoch.log
in the catalog directory. This would be the same directory that contains the vertica.log
Take the hexidecimal number in Epoch.log and convert it to a decimal number.
Then run admintools -t restart_db -d DB_NAME -e <Decimal_number_of_LGE_found_in_Epoch.log>
Thank you, @mosheg , for your response. I attempted to restart the database using admintools -t restart_db -d DB_NAME -e 5280530 as you said but encountered the following error:
Invalid value for last good epoch: '5280530'
Epoch number must be 'last' or between 5287251 and 5280530 inclusive
I believe the reason for this error is that the AHMepoch value is greater than the LGE value, which should not be the case. To confirm this, please refer to the output of the following command:
=> select get_current_epoch() CE,get_last_good_epoch() LGE,get_ahm_epoch () AHM,(get_current_epoch()- get_last_good_epoch()) CeLGDiff,(get_last_good_epoch()-get_ahm_epoch())LgeAHmDiff,get_expected_recovery_epoch();
INFO 4544: Recovery Epoch Computation:
Node Dependencies:
1 - cnt: 60
1 - name: v_sw_node0001
Nodes certainly in the cluster:
Node 0(v_sw_node0001), epoch 5280530
Filling more nodes to satisfy node dependencies:
Data dependencies fulfilled, remaining nodes LGEs don't matter:
CE | LGE | AHM | CeLGDiff | LgeAHmDiff | get_expected_recovery_epoch
-----------+-----------+-----------+----------+---------------+-----------------------------
5287274 | 5280530 | 5287251 | 6744 | -6721 | 5280530
(1 row)
Unfortunately, I am unable to seek assistance from Cisco as I do not have a support contract. I am willing to take any risks necessary to recover my data because I already lost them.
Kind regards
Please try the following:
1) To start the database in unsafe mode:
2) The below query Identify impacted projections:
3) Once we find the projections and decide to bring the DB up from higher epoch we need to run abortrecovery on those projections.
After we know the relevant checkpoint_epoch we can use the below query and that will create the abort command:
Or disable recovery for table/projection having issue by running following command.
Only if abortrecovery does not help consider to drop the table:
4) Restart from Last Good Epoch: (You might lose data and this will also roll back any uncommitted transactions)
With above steps Vertica will not recover a specific projection and table.
This table will have data consistency issues and it is recommended to recreate this table by loading data from source files or by copying data from corrupted table.
Thank you, @mosheg , for your response and assistance. I followed your suggestion and dropped one of the tables, as can be seen in the attached pictures. However, despite this action, the database is still not working. Furthermore, I noticed that the AHM value is greater than LGE. Do you have any recommendations? I would prefer not to drop any more tables as they contain important data.
Before dropping the table:
After:
Kind regards
-- Try the following but please note that you might lose data
-- Start the database in unsafe mode:
admintools -t start_db -d YourDBname -p DBpassword -U
-- If dfs files missing remove the dfs files:
set session characteristics as transaction read write;
\i /opt/vertica/packages/logsearch/ddl/uninstall.sql
-- Verify again LGE & AHM and you will find that LGE is now >=AHM.
select ahm_epoch,last_good_epoch from system;
-- Stop the db
$ admintools -t stop_db -d DBname
-- Try to start it normally
$ admintools -t start_db -d DBname
-- Connect to db and recover the dfs files by running below command
\i /opt/vertica/packages/logsearch/ddl/install.sql
@mosheg Thank you once again for your response. I followed your suggestion and executed the command, however, I encountered errors during the process:
Please also check this:
Unfortunately the issue persists and still the AHM value is greater than LGE. Could you kindly offer another recommendation to help resolve this issue? Thank you for your assistance.
Kind regards
Try the following but please note that you might lose data
For each business critical table that was marked with abortrecovery, use one of the following options to fix data integrity issues:
Start the database in unsafe mode:
For non-partitioned table:
A. Create a new table and insert select data from the impacted fact table into a new table.
The new table may have missing data, but there is no count mismatch between two buddy projections of a table.
B. Drop CASCADE the impacted table and rename the new table to the original name.
For partitioned table:
1. Identify partitions that have a count mismatch between two buddy projections by running the following query.
For partitions with a count mismatch between buddy projections, you can drop the partition and reload, or move the impacted partition to a new table using move_partitions_to_table.
Then insert select data from the new table to the original table. Impacted partitions will have some data loss here, but end result is no count mismatch between two buddy projections of a table.
Once you have fixed all impacted partitions in a table, the above query (step 1) should return an empty result set.
Run the following query to leave a comment for future reference:
I was wondering the example query to identify count mismatch, and how it can work. While it joins on partition_key between 2 references to the partition table (a & b), the only predicate constraint is on the projection_name of a. When I have many projections of different tables with the same partition_key (a date-stamp), this incorrectly compares counts from different tables' projections.
A more correct query would seem to be: