Node recovery during live etls and user operations.
Hi,
We were recovering nodes last week which were down due issues in disks.
Ohter nodes we made down (non buddy) by using admintools stop vertica on host to change the disks in servers.
What we have observed in recovery-
During live etls and users working incremental recovery gets stucked at various times at historical completed phase.
And in order to do recovery we have to shutdown recovery on those nodes and do a scratch recovery by making
select make_ahm_now('true'); and taking downtime of etls and end users.
But we want to avoid downtime and scratch recovery in future scenarios as our business gets impacted everytime.
During live and incremental recovery logs kept saying this which is very similar to this post-
https://smallstepstobigdata.wordpress.com/2017/10/17/vertica-automatic-recovery-not-possible/
2021-04-24 13:34:21.127 Timer Service:7f55717fb700-b000002ce84e9b [Recover] Get Local Node LGE: projection stage_temp.product_offer_b1, CPE is 0x200be24
2021-04-24 13:34:21.127 Timer Service:7f55717fb700-b000002ce84e9b [Recover] Get Local Node LGE: projection stage_temp.product_offer_b0, CPE is 0x200be24
2021-04-24 13:34:21.127 Timer Service:7f55717fb700-b000002ce84e9b [Txn] Rollback Txn: b000002ce84e9b 'ProjUtil::getLocalNodeLGE'
2021-04-24 13:34:21.129 Timer Service:7f55717fb700 [Recover] My local node LGE = 0x200be24 and current epoch = 0x200c00b
2021-04-24 13:34:21.129 Recover:7f55c5dfc700 [Recover] Node v_snapdealdwh_node0002 assessing recovery: I am not clerk
Can we also rollback database to LGE and try or is their anything else we can do for live incremental recovery such as
dropping projections and refreshing?
Answers
During node recovery we stop some of our processes.
We introduced a message table those cron jobs check before starting.
If maintenance mode is enabled in that table they don't run.
We do a lot of table sync between an external database and Vertica.
Earlier we just run a delete/insert for the whole table.
Nowadays we store the data in a temporary table and check with select .. except select between the destination and the temporary table. If nothing changed we don't reinsert the data.
This helps a lot if a node leaves the cluster.
Since then we seldom use the make_ahm_now approach.