Vertica is failing to start
/vertica/data/data/unified_db/v_unified_db_node0001_catalog/Epoch.log
—
Last good epoch: 0x2c25d ended at '2017-01-16 13:05:48.039817-05'
Last good catalog version: 0x32139
Vertical is failing to start:
we tried to start with last good epoch:
/opt/vertica/bin/admintools --tool restart_db -d unified_db -p XXX -e last
Info: Using default last good epoch
*** Restarting database unified_db at '0' epoch 0 ***
Unable to read database catalogs - cannot start database.
Database startup at '0' (epoch 0) failed. Contact Vertica Technical Support
0
Comments
Let me explain what's going on with single node Vertica DB example for simplicity.
Single node Vertica DB can run into this situation, if a node had some kind of issue with a ROS containers saved on disk. Vertica checks for validity of each ROS container at startup and may find one ROS container to be missing or corrupted. If Vertica find ROS container missing , it will mark checkpoint epoch of the projection having missing ROS container to epoch one less than start epoch of missing ROS container. LGE epoch of a node is min of checkpoint epoch of projections on that node and above situation will pull LGE of node back too.
Vertica does not allow recovery of to epoch prior to AHM(ancient history mark). Vertica does not keep map of epoch and timestamps for epochs prior to AHM epoch. If node recovery is suggesting LGE to be older than AHM epoch , vertica has not way to map this to timestamp and for that reason , you are seeing recovery epoch to 0.
How to fix this situation in database after Version 7.2.x:
admintools -t start_db -d -U
Query projection_checkpoint_epochs table to find projection that has CPE lower than AHM epoch. You can use get_ahm_epoch() api to find AHM epoch.
Disable recovery for table/projection having issue by running following command
SELECT do_tm_task('abortrecovery','
<
table name>');
IMPORTANT: You need to know that with above steps you have told vertica to not recover a specific projection and table. This table will have data consistency issues and it is recomended to recreate this table by loading data from source files or by copying data from corrupted table.
Hope this helps ..
Thank you for this instruction. It help us today.
Today part of our servers were randomly shutdown while cluster recovery. Due to electricity. Catalog became inconsistent and each node had each own LGE that was less than AHM. Cluster didn't start nor by force option, nor by rollbacking to last LGE.
It would be great if such technical points would be reflected in the documentation.