Vertica Recovery Epoch
Hello,
Our Vertica database is in the middle of a recovery operation right now and I'm a little confused as to what this particular information means on the table? I have looked at the documenation on https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/SQLReferenceManual/SystemTables/MONITOR/RECOVERY_STATUS.htm but it really doesn't tell you much about what the individual columns mean. I'm particularly curious why no other nodes in my cluster have a recover_epoch, why node 0002 seems to be doing all the recovery, why current_completed has progressed further than recover_epoch, and why this recovery only sits at about 20% CPU on node 0002 while many table remain unaccessible? I am running Vertica 7.2.2. Any insight would be appreciated!
Thank you!
Below is the recovery table:
=> SELECT node_name, recover_epoch, recovery_phase, current_completed, current_total, is_running FROM recovery_status; node_name | recover_epoch | recovery_phase | current_completed | current_total | is_running ------------------------------+---------------+----------------+-------------------+---------------+------------ v_namara_production_node0001 | | | 0 | 0 | f v_namara_production_node0002 | 105797 | | 127548 | 0 | t v_namara_production_node0003 | | | 0 | 0 | f (3 rows)
Comments
Hello,
The recover_epoch is the epoch Vertica picked when node's state changed to RECOVERING. All data node missed before recover_epoch will be recovered by copying from its buddy. This step is also called "historical phase". All data after recover_epoch will be directly loaded(INSERT, COPY) or recovered by replay delete(DELETE).
If there is nothing in column "recover_epoch" for other nodes, this can mean no other nodes are going through recovery. You can also see this from "is_running". Only node0002's is_running is "true". Usually, there should be something for other nodes if those node ever went through recovery. But it is possible that the recovery_status table has been cleaned up before. Something like: select clear_data_collector() may have happened in your database. This will clean up the historical information of recovery_status.
The current_completed is the recovery pojection count instead of an epoch. So in your data base, 127548 projections have been recovered on node0002.
There is maxConcurrency for recovery pool which controls how many projections can be recovered concurrently. So, 20% CPU doesn't mean more projections can be recovered concurrently.
For "many table are unaccessible", are you using recover by table or locklite recovery? Which step are you in? "Unaccessible" means there are lock on those tables? More information will be appreciated to answer this question.
Thanks,
Jing