Vertica startup failure due to power cut (ASR required)

MRvahid · July 2023

I am currently facing an issue with the Cisco Stealthwatch application, which utilizes the Vertica database. After experiencing an unclean power down, the database is unable to start. I have attempted to restart the database from the last epoch, but unfortunately, it has not been successful. Upon checking the log file, I have come across the following error message:

"stage" : "Startup Failed, ASR Required",
"text" : "Node Dependencies:\n1 - cnt: 60\n\n1 - name: v_sw_node0001\nNodes certainly in the cluster:\n\tNode 0(v_sw_node0001), epoch 5280530\nFilling more nodes to satisfy node dependencies:\nData dependencies fulfilled, remaining nodes LGEs don't matter:\n--",

I kindly request assistance in resolving this issue.

Bryan_H · July 2023

"ASR required" stands for Admin Specified Recovery, and we strongly recommend contacting product support for assistance to avoid data loss if the process is not able to start via start_db or restart_db without this error.

MRvahid · July 2023

Thank you, @Bryan_H , for your response. I attempted to use the start_db and restart_db commands but they were unsuccessful. Additionally, when I tried using start_db -U, the database briefly becomes up for a few seconds before going down again. It is unfortunate that I do not have a Cisco support contract, as this means I am unable to seek assistance from them. Consequently, I am willing to take any necessary risks in order to recover my data since I do not have any other alternatives available to me.
Kind regards

MRvahid · July 2023

.

Bryan_H · July 2023

Did you try: admintools -t restart_db -d DBNAME -p DBPASSWORD -e last -i
If the above fails, next try: admintools -t start_db -d DBNAME -p DBPASSWORD -i -F
Let us know the outcome. Hopefully these will start your DB, full ASR is a bit more work...

MRvahid · July 2023

Dear @Bryan_H ,

I have attempted both commands, but unfortunately, neither of them seems to be effective and database is still down. I was wondering if you could provide any suggestions or solutions to resolve this issue.

Thank you in advance for your assistance.

Best regards

MRvahid · July 2023

@Bryan_H ,
I attempted to execute the following commands as well, but encountered the subsequent error:

admintools -t start_db -d DBname -p DBpassword -U
vsql -U DBuser -w password
=> SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED READ WRITE;
=> ALTER DATABASE DBname SET TransactionMode = 'READ WRITE';
=> select make_ahm_now();

INTERNAL 5445: VIAssert(currAHMEpoch <= clusterLGE) failed
DETAIL: /data/jenkins/workspace/RE-ReleaseBuilds/RE-Jackhammer/server/vertica/Commands/PGCall.cpp: 19244
HINT: Please report this error to Vertica; try restating your query

mosheg · July 2023

As Bryan mentioned it is strongly recommended to contact support for assistance to avoid data loss.
Please note that when you start a database from a specific Epoch you might lose data
as this can also roll back any uncommitted transactions.

The following command: admintools -t start_db -d -F
Will likely fail but the act of attempting a "force" repair of the Vertica database will put an Epoch.log
in the catalog directory. This would be the same directory that contains the vertica.log
Take the hexidecimal number in Epoch.log and convert it to a decimal number.
Then run admintools -t restart_db -d DB_NAME -e <Decimal_number_of_LGE_found_in_Epoch.log>

MRvahid · July 2023

Thank you, @mosheg , for your response. I attempted to restart the database using admintools -t restart_db -d DB_NAME -e 5280530 as you said but encountered the following error:

Invalid value for last good epoch: '5280530'
Epoch number must be 'last' or between 5287251 and 5280530 inclusive

I believe the reason for this error is that the AHMepoch value is greater than the LGE value, which should not be the case. To confirm this, please refer to the output of the following command:

=> select get_current_epoch() CE,get_last_good_epoch() LGE,get_ahm_epoch () AHM,(get_current_epoch()- get_last_good_epoch()) CeLGDiff,(get_last_good_epoch()-get_ahm_epoch())LgeAHmDiff,get_expected_recovery_epoch();
INFO 4544: Recovery Epoch Computation:
Node Dependencies:
1 - cnt: 60
1 - name: v_sw_node0001
Nodes certainly in the cluster:
Node 0(v_sw_node0001), epoch 5280530
Filling more nodes to satisfy node dependencies:

Data dependencies fulfilled, remaining nodes LGEs don't matter:

CE | LGE | AHM | CeLGDiff | LgeAHmDiff | get_expected_recovery_epoch
-----------+-----------+-----------+----------+---------------+-----------------------------
5287274 | 5280530 | 5287251 | 6744 | -6721 | 5280530
(1 row)

Unfortunately, I am unable to seek assistance from Cisco as I do not have a support contract. I am willing to take any risks necessary to recover my data because I already lost them.

Kind regards

mosheg · July 2023

Please try the following:
1) To start the database in unsafe mode:

     admintools -t start_db -d YourDBname -p DBpassword -U

2) The below query Identify impacted projections:

   SELECT e.node_name, t.table_schema, t.table_name, e.projection_schema, e.projection_name, checkpoint_epoch 
  FROM projection_checkpoint_epochs e, projections p, tables t 
  WHERE e.projection_id = p.projection_id and p.anchor_table_id = t.table_id and is_behind_ahm;

3) Once we find the projections and decide to bring the DB up from higher epoch we need to run abortrecovery on those projections.
After we know the relevant checkpoint_epoch we can use the below query and that will create the abort command:

SELECT 'select do_tm_task('||'''abortrecovery'''||','||''''||projection_schema||'.'|| projection_name ||''');' AS command 
   FROM (select distinct projection_schema, projection_name from v_catalog.projection_checkpoint_epochs where checkpoint_epoch = your_checkpoint_epoch) AS v1;

Or disable recovery for table/projection having issue by running following command.

   SELECT do_tm_task('abortrecovery','<table name>');

Only if abortrecovery does not help consider to drop the table:

SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED READ WRITE;
drop table my_useless_unrecoverable_table cascade;

4) Restart from Last Good Epoch: (You might lose data and this will also roll back any uncommitted transactions)

  admintools -t restart_db -d DB_NAME -e last

With above steps Vertica will not recover a specific projection and table.
This table will have data consistency issues and it is recommended to recreate this table by loading data from source files or by copying data from corrupted table.

MRvahid · July 2023

Thank you, @mosheg , for your response and assistance. I followed your suggestion and dropped one of the tables, as can be seen in the attached pictures. However, despite this action, the database is still not working. Furthermore, I noticed that the AHM value is greater than LGE. Do you have any recommendations? I would prefer not to drop any more tables as they contain important data.

Before dropping the table:

After:

Kind regards

mosheg · July 2023

-- Try the following but please note that you might lose data
-- Start the database in unsafe mode:
admintools -t start_db -d YourDBname -p DBpassword -U

-- If dfs files missing remove the dfs files:
set session characteristics as transaction read write;
\i /opt/vertica/packages/logsearch/ddl/uninstall.sql

-- Verify again LGE & AHM and you will find that LGE is now >=AHM.
select ahm_epoch,last_good_epoch from system;

-- Stop the db
$ admintools -t stop_db -d DBname

-- Try to start it normally
$ admintools -t start_db -d DBname

-- Connect to db and recover the dfs files by running below command
\i /opt/vertica/packages/logsearch/ddl/install.sql

MRvahid · July 2023

@mosheg Thank you once again for your response. I followed your suggestion and executed the command, however, I encountered errors during the process:

dbadmin=> \i /opt/vertica/packages/logsearch/ddl/uninstall.sql
vsql:/opt/vertica/packages/logsearch/ddl/uninstall.sql:1: ERROR 5836:  DFSFile: File [/tokenizersConfigurations] not found for deletion
vsql:/opt/vertica/packages/logsearch/ddl/uninstall.sql:2: ROLLBACK 3855:  Library "logSearchLib" does not exist

Please also check this:

Unfortunately the issue persists and still the AHM value is greater than LGE. Could you kindly offer another recommendation to help resolve this issue? Thank you for your assistance.
Kind regards

mosheg · July 2023

Try the following but please note that you might lose data
For each business critical table that was marked with abortrecovery, use one of the following options to fix data integrity issues:
Start the database in unsafe mode:

admintools -t start_db -d YourDBname -p DBpassword -U
-- Later, in the relevant session do:
SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED READ WRITE;

For non-partitioned table:
A. Create a new table and insert select data from the impacted fact table into a new table.
The new table may have missing data, but there is no count mismatch between two buddy projections of a table.
B. Drop CASCADE the impacted table and rename the new table to the original name.

For partitioned table:
1. Identify partitions that have a count mismatch between two buddy projections by running the following query.

SELECT partition_key, diff FROM (SELECT a.partition_key,sum(a.ros_row_count - a.deleted_row_count) - sum(b.ros_row_count - b.deleted_row_count) as diff 
FROM partitions a JOIN partitions b ON a.partition_key = b.partition_key  
WHERE a.projection_name in (<projection name and buddy projection name>)  group by 1 ) as sub WHERE diff <> 0;

For partitions with a count mismatch between buddy projections, you can drop the partition and reload, or move the impacted partition to a new table using move_partitions_to_table.
Then insert select data from the new table to the original table. Impacted partitions will have some data loss here, but end result is no count mismatch between two buddy projections of a table.
Once you have fixed all impacted partitions in a table, the above query (step 1) should return an empty result set.
Run the following query to leave a comment for future reference:

COMMENT ON TABLE <schema_name.table name> is 'abortrecovery was run by <YourName> on <date>';

JoshLitt · November 2024

I was wondering about the example query to identify count mismatch, and how it can work. While it joins on partition_key between 2 references to the partition table (a & b), the only predicate constraint is on the projection_name of a. When I have many projections of different tables with the same partition_key (a date-stamp), this incorrectly compares counts from different tables' projections.

A more correct query would seem to be:

SELECT partition_key, diff FROM (SELECT a.partition_key,sum(a.ros_row_count - a.deleted_row_count) -
sum(b.ros_row_count - b.deleted_row_count) as diff FROM partitions a JOIN partitions b ON a.partition_key =
b.partition_key  WHERE a.projection_name = <projection_b0> and b.projection_name = <projection_b1>
group by 1) as sub WHERE diff <> 0;

We're Moving!

Create My New Community Account Now

Vertica startup failure due to power cut (ASR required)

Answers

Data dependencies fulfilled, remaining nodes LGEs don't matter:

Leave a Comment