recovery failing after power outage
Hi, I have a test environment that experienced a power outage on all 3 nodes (vertica-9.2.0-7) and the storage array. The system is not critical but it's weird that I cannot get the DB to recover:
[dbadmin@vdb2 ~]$ admintools -t start_db -d opsadb -U Info: no password specified, using none Starting nodes: v_opsadb_node0001 (192.168.252.31) v_opsadb_node0002 (192.168.252.216) v_opsadb_node0003 (192.168.252.217) Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize. Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) Node Status: v_opsadb_node0001: (DOWN) v_opsadb_node0002: (DOWN) v_opsadb_node0003: (DOWN) Node Status: v_opsadb_node0001: (UP) v_opsadb_node0002: (UP) v_opsadb_node0003: (UP) Database opsadb: Startup Succeeded. All Nodes are UP [dbadmin@vdb2 ~]$ vsql ... dbadmin=> SELECT get_ahm_epoch(); get_ahm_epoch --------------- 109626381 (1 row) dbadmin=> SELECT get_expected_recovery_epoch(); INFO 4544: Recovery Epoch Computation: Node Dependencies: 011 - cnt: 847 101 - cnt: 847 110 - cnt: 847 111 - cnt: 158 001 - name: v_opsadb_node0001 010 - name: v_opsadb_node0002 100 - name: v_opsadb_node0003 Nodes certainly in the cluster: Node 2(v_opsadb_node0003), epoch 109610889 Node 1(v_opsadb_node0002), epoch 104348833 Filling more nodes to satisfy node dependencies: Data dependencies fulfilled, remaining nodes LGEs don't matter: Node 0(v_opsadb_node0001), epoch 104348128 -- get_expected_recovery_epoch ----------------------------- 104348833 (1 row)
So far so good, we should be able to recover to 104348833. But:
[dbadmin@vdb2 ~]$ admintools -t restart_db -d opsadb -e '104348833' -p xxx Invalid value for last good epoch: '104348833' Epoch number must be 'last' or between 109626381 and 104348833 inclusive
I have tried various values for the epoch, including the 2 mentioned values - always the same error message. Now my questions:
1. The range is backwards, AHM > LGE. Is that the reason for the (poor!) error message? Is it that my recovery epoch must be >=AHM and <=LGE, and therefore I don't have a chance of (normal) recovery here?
2. Can someone come up with a hypothesis about what must have gone wrong to end up like that? As mentioned, the power loss affected the 3 nodes and the storage, but no disks were damaged. I thought the DB can always be recovered if the files that made it to the disk are unharmed ...
I know (after reading https://softwaresupport.softwaregrp.com/doc/KM03449287 ) that I can try to salvage table data, but more than 2000 projections are affected. As it is a test env I will just revert to an older snapshot - this post is only for my curiosity.
Thank you for any insights!