recovery failing after power outage

bertramfbertramf Community Edition User Employee

Hi, I have a test environment that experienced a power outage on all 3 nodes (vertica-9.2.0-7) and the storage array. The system is not critical but it's weird that I cannot get the DB to recover:

[dbadmin@vdb2 ~]$ admintools -t start_db -d opsadb -U
Info: no password specified, using none
    Starting nodes: 
        v_opsadb_node0001 (192.168.252.31)
        v_opsadb_node0002 (192.168.252.216)
        v_opsadb_node0003 (192.168.252.217)
    Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
    Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) 
    Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) 
    Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) 
    Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) 
    Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) 
    Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) 
    Node Status: v_opsadb_node0001: (UP) v_opsadb_node0002: (UP) v_opsadb_node0003: (UP) 
Database opsadb: Startup Succeeded.  All Nodes are UP
[dbadmin@vdb2 ~]$ vsql 
...
dbadmin=> SELECT get_ahm_epoch();
 get_ahm_epoch 
---------------
     109626381
(1 row)
dbadmin=> SELECT get_expected_recovery_epoch();
INFO 4544:  Recovery Epoch Computation:
Node Dependencies:
011 - cnt: 847
101 - cnt: 847
110 - cnt: 847
111 - cnt: 158

001 - name: v_opsadb_node0001
010 - name: v_opsadb_node0002
100 - name: v_opsadb_node0003
Nodes certainly in the cluster:
    Node 2(v_opsadb_node0003), epoch 109610889
    Node 1(v_opsadb_node0002), epoch 104348833
Filling more nodes to satisfy node dependencies:
Data dependencies fulfilled, remaining nodes LGEs don't matter:
    Node 0(v_opsadb_node0001), epoch 104348128
--
 get_expected_recovery_epoch 
-----------------------------
                   104348833
(1 row)

So far so good, we should be able to recover to 104348833. But:

[dbadmin@vdb2 ~]$ admintools -t restart_db -d opsadb -e '104348833' -p xxx
Invalid value for last good epoch: '104348833'
Epoch number must be 'last' or between 109626381 and 104348833 inclusive

I have tried various values for the epoch, including the 2 mentioned values - always the same error message. Now my questions:
1. The range is backwards, AHM > LGE. Is that the reason for the (poor!) error message? Is it that my recovery epoch must be >=AHM and <=LGE, and therefore I don't have a chance of (normal) recovery here?
2. Can someone come up with a hypothesis about what must have gone wrong to end up like that? As mentioned, the power loss affected the 3 nodes and the storage, but no disks were damaged. I thought the DB can always be recovered if the files that made it to the disk are unharmed ...

I know (after reading https://softwaresupport.softwaregrp.com/doc/KM03449287 ) that I can try to salvage table data, but more than 2000 projections are affected. As it is a test env I will just revert to an older snapshot - this post is only for my curiosity.
Thank you for any insights!

Sign In or Register to comment.