VERTICA NODE SLOW RECOVERY
I followed all these steps but still my recovery of one node is very slow.
But Its in recovering state and recovery phase is historical with 101/11374(historical_completed/historical_total) completed in 1.30 hour .
Following these steps should help you on bringing the node up:
1) We need to verify if Vertica is running in the target node
a) # ps aux | grep vertica
e.g.
502 285094 139 16.1 48748200 21369020 Sl Jun05 60062:29 /opt/vertica/bin/vertica -D /catalog/scrutinload/v_scrutinload_node0001_catalog -C scrutinload -n v_scrutinload_node0001 -h 192.168.10.49 -p 5433 -P 5443 -Y ipv4 -S 2158089
b) If it's running we need to kill -9 in the target node
kill -9 285094
c) After killing the sessions, we need to validate there is not any spread.pid file under the catalog directory for target node. If there is one, rename it.
2) If this doesn't work, try the following:
a) Using Admintools, Restart Vertica
b) using Admintools command line:
$ admintools -t restart_node -F -s -d
3) If above can't work still, try recover the node from scratch:
a) Rename/Remove(or move elsewhere until the node is up) and recreate the two directories (data and catalog) below on node
Make sure path, owner, group, and permission mask are exactly the same as old directories:
/home/dbadmin/XXXdb_name/XXXXXnode_name_catalog
/home/dbadmin/XXXdb_name/XXXXXnode_name_data
b) Stop the loads/DML queries if there are any.
c) Force a Moveout
=> Select DO_TM_TASK(‘moveout’);
d) Make sure the node is completely down. Advance the AHM with true option
=> Select make_ahm_now('true');
Note: This may take a short while depending on system activity.
e) Do a select * from system; to confirm the AHM Epochs have moved forward and close LGE.
f) Do a force restart on the down node
Get the IP address of the down node from vsql Select * from nodes;
g) xRun the command admintools to force restart the node:
$ admintools -t restart_node --force -s -d < database name>
Answers
Today we have collected vioperf reports(quickly tests the performance of your host's input and output) from the problematic node and a good node.
Problematic Node 16 has counter value 0 for write-
test | directory | counter name | counter value | counter value (10 sec avg) | counter value/core | counter value/core (10 sec avg) | thread count | %CPU | %IO Wait | elapsed time (s)| remaining time (s)
Write | ...store| MB/s | 0 | 0 | 0 | 0 | 36 | 0 | 20 | 10 | 5
Write | ...store| MB/s | 0 | 0 | 0 | 0 | 36 | 0 | 19 | 15 | 0
ReWrite | ...store| (MB-read+MB-write)/s| 0+0 | 0+0 | 0+0 | 0+0 | 36 | 0 | 15 | 10 | 5
ReWrite | ...store| (MB-read+MB-write)/s| 0+0 | 0+0 | 0+0 | 0+0 | 36 | 0 | 12 | 15 | 0
Report from Node 1 which is fine-
test | directory | counter name | counter value | counter value (10 sec avg) | counter value/core | counter value/core (10 sec avg) | thread count | %CPU | %IO Wait | elapsed time (s)| remaining time (s)
Write | ...store| MB/s | 920 | 920 | 25.5556 | 25.5556 | 36 | 40 | 15 | 10 | 5
Write | ...store| MB/s | 871 | 772 | 24.1944 | 21.4444 | 36 | 57 | 10 | 15 | 0
ReWrite | ...store| (MB-read+MB-write)/s| 686+686 | 686+686 | 19.0556+19.0556 | 19.0556+19.0556 | 36 | 53 | 11 | 10 | 5
ReWrite | ...store| (MB-read+MB-write)/s| 690+690 | 709+709 | 19.1667+19.1667 | 19.69
Can anyone please help us?
Regarding the outputs of vioperf command, can you share with us the following items?