I've HP Vertica cluster of three nodes. Two of them are working properly and the third is in the initializing mode. I would like to completely restart the database. How can I take out a third node of this state?
Which vertica version are you using. I suppose your database is ksafe=1. There could be some data corruption issue on problematic node. You can kill the vertica process on the problematic node and can try to restart the node forcefully. Can use below command /opt/vertica/bin/admintools -t restart_node -s <this_Hostname_or_IP> -d dbname -p password -F
If this does not solve the issue then we need to look into the vertica.log of the problematic node.
If it's ksafe=1, yes you can take out the node. But before you go for removing and adding back the same node why don't you try to bring the down node from scratch. That means, from the down node rename the data and catalog folder. Go to one of the up node and run select make_ahm_now('true'). And then use the admintools to bring the down node up. Hope this helps. Try and let us know the result.
Thanks to your information, but I've had the ksafe test result: select current_fault_tolerance from system; WARNING 4539: Received no response from v_verticatest_node0001 in get cluster LGE current_fault_tolerance ------------------------- 0 (1 row) Can I remove node from cluster?
So, your cluster is designed with designed_fault_tolerance=1, but since currently one node is down, so current_fault_tolerance=0, which is expected. Before remove the node, I would like you to please do the following, may be your node001 come up and you do not require to remove it. 1. Execute the below command on v_verticatest_node0001 ps -aef|grep vertica | grep C on If it returns process, kill that process with 'kill -9 <process id>'
2. Login with VSQL via dbadmin user from any of the UP nodes and advance the AHM using below command select make_ahm_now(true);
3. Now login to admintools from any of the UP nodes, and start the vertica process on down node i.e. v_verticatest_node0001. This will start recovery on node001
Monitor the recovery process by quering the below table select * from recovery_status
Comments
Which vertica version are you using.
I suppose your database is ksafe=1.
There could be some data corruption issue on problematic node.
You can kill the vertica process on the problematic node and can try to restart the node forcefully. Can use below command
/opt/vertica/bin/admintools -t restart_node -s <this_Hostname_or_IP> -d dbname -p password -F
If this does not solve the issue then we need to look into the vertica.log of the problematic node.
Thanks & Regards,
Shobhit
Look for current_fault_tolerance column, if it is 1 then your database is ksafe=1
select current_fault_tolerance from system;
WARNING 4539: Received no response from v_verticatest_node0001 in get cluster LGE
current_fault_tolerance
-------------------------
0
(1 row)
Can I remove node from cluster?
select * from nodes and
select * from system;
verticatest=> select node_name,node_state,catalog_path from nodes; node_name | node_state | catalog_path
------------------------+------------+-----------------------------------------------------------------
v_verticatest_node0001 | DOWN | /mnt/vertica/verticatest/v_verticatest_node0001_catalog/Catalog
v_verticatest_node0002 | UP | /mnt/vertica/verticatest/v_verticatest_node0002_catalog/Catalog
v_verticatest_node0003 | UP | /mnt/vertica/verticatest/v_verticatest_node0003_catalog/Catalog
(3 rows)
verticatest=> select * from system;WARNING 4539: Received no response from v_verticatest_node0001 in get cluster LGE
current_epoch | ahm_epoch | last_good_epoch | refresh_epoch | designed_fault_tolerance | node_count | node_down_count | current_fault_tolerance | catalog_revision_number | wos_used_bytes | wos_row_count | ros_used_bytes | ros_row_count | total_used_bytes | total_row_count
---------------+-----------+-----------------+---------------+--------------------------+------------+-----------------+-------------------------+-------------------------+----------------+---------------+----------------+---------------+------------------+-----------------
56782 | 56780 | 56781 | -1 | 1 | 3 | 1 | 0 | 113867 | 0 | 0 | 375718735 | 38563730 | 375718735 | 38563730
(1 row)
May be I take out some fields from last query?
Before remove the node, I would like you to please do the following, may be your node001 come up and you do not require to remove it.
1. Execute the below command on v_verticatest_node0001
ps -aef|grep vertica | grep C on
If it returns process, kill that process with 'kill -9 <process id>'
2. Login with VSQL via dbadmin user from any of the UP nodes and advance the AHM using below command
select make_ahm_now(true);
3. Now login to admintools from any of the UP nodes, and start the vertica process on down node i.e. v_verticatest_node0001. This will start recovery on node001
Monitor the recovery process by quering the below table
select * from recovery_status
Lets see how it goes.
[root@ip-10-15-242-174 ~]# ps -aef|grep verticadbadmin 1048 1 0 Oct14 ? 00:00:00 /bin/bash /opt/vertica/agent/agent.sh /opt/vertica/config/users/dbadmin/agent.conf
dbadmin 1060 1048 0 Oct14 ? 00:18:51 /opt/vertica/oss/python/bin/python ./simply_fast.py
dbadmin 1072 1 0 Oct14 ? 00:18:23 /opt/vconsole/vendor/oracle/java/jre/1.6/bin/java -Dvertica.home=/opt/vertica -Dvconsole.home=/opt/vconsole -Djava.library.path=/opt/vconsole/lib -Dderby.system.home=/opt/vconsole/mcdb/derby -Xmx2048m -Xms1024m -XX:MaxPermSize=256m -jar /opt/vconsole/lib/webui.war
I must kill all the proceses?