Node recovery during live etls and user operations.
We were recovering nodes last week which were down due issues in disks.
Ohter nodes we made down (non buddy) by using admintools stop vertica on host to change the disks in servers.
What we have observed in recovery-
During live etls and users working incremental recovery gets stucked at various times at historical completed phase.
And in order to do recovery we have to shutdown recovery on those nodes and do a scratch recovery by making
select make_ahm_now('true'); and taking downtime of etls and end users.
But we want to avoid downtime and scratch recovery in future scenarios as our business gets impacted everytime.
During live and incremental recovery logs kept saying this which is very similar to this post-
2021-04-24 13:34:21.127 Timer Service:7f55717fb700-b000002ce84e9b [Recover] Get Local Node LGE: projection stage_temp.product_offer_b1, CPE is 0x200be24
2021-04-24 13:34:21.127 Timer Service:7f55717fb700-b000002ce84e9b [Recover] Get Local Node LGE: projection stage_temp.product_offer_b0, CPE is 0x200be24
2021-04-24 13:34:21.127 Timer Service:7f55717fb700-b000002ce84e9b [Txn] Rollback Txn: b000002ce84e9b 'ProjUtil::getLocalNodeLGE'
2021-04-24 13:34:21.129 Timer Service:7f55717fb700 [Recover] My local node LGE = 0x200be24 and current epoch = 0x200c00b
2021-04-24 13:34:21.129 Recover:7f55c5dfc700 [Recover] Node v_snapdealdwh_node0002 assessing recovery: I am not clerk
Can we also rollback database to LGE and try or is their anything else we can do for live incremental recovery such as
dropping projections and refreshing?