Problem with restart database
I've HP Vertica cluster of three nodes. Two of them are working properly and the third is in the initializing mode. I would like to completely restart the database. How can I take out a third node of this state?
0
Comments
Which vertica version are you using.
I suppose your database is ksafe=1.
There could be some data corruption issue on problematic node.
You can kill the vertica process on the problematic node and can try to restart the node forcefully. Can use below command
/opt/vertica/bin/admintools -t restart_node -s <this_Hostname_or_IP> -d dbname -p password -F
If this does not solve the issue then we need to look into the vertica.log of the problematic node.
Thanks & Regards,
Shobhit
Look for current_fault_tolerance column, if it is 1 then your database is ksafe=1
select current_fault_tolerance from system;
WARNING 4539: Received no response from v_verticatest_node0001 in get cluster LGE
current_fault_tolerance
-------------------------
0
(1 row)
Can I remove node from cluster?
select * from nodes and
select * from system;
verticatest=> select node_name,node_state,catalog_path from nodes; node_name | node_state | catalog_path
------------------------+------------+-----------------------------------------------------------------
v_verticatest_node0001 | DOWN | /mnt/vertica/verticatest/v_verticatest_node0001_catalog/Catalog
v_verticatest_node0002 | UP | /mnt/vertica/verticatest/v_verticatest_node0002_catalog/Catalog
v_verticatest_node0003 | UP | /mnt/vertica/verticatest/v_verticatest_node0003_catalog/Catalog
(3 rows)
verticatest=> select * from system;WARNING 4539: Received no response from v_verticatest_node0001 in get cluster LGE
current_epoch | ahm_epoch | last_good_epoch | refresh_epoch | designed_fault_tolerance | node_count | node_down_count | current_fault_tolerance | catalog_revision_number | wos_used_bytes | wos_row_count | ros_used_bytes | ros_row_count | total_used_bytes | total_row_count
---------------+-----------+-----------------+---------------+--------------------------+------------+-----------------+-------------------------+-------------------------+----------------+---------------+----------------+---------------+------------------+-----------------
56782 | 56780 | 56781 | -1 | 1 | 3 | 1 | 0 | 113867 | 0 | 0 | 375718735 | 38563730 | 375718735 | 38563730
(1 row)
May be I take out some fields from last query?
Before remove the node, I would like you to please do the following, may be your node001 come up and you do not require to remove it.
1. Execute the below command on v_verticatest_node0001
ps -aef|grep vertica | grep C on
If it returns process, kill that process with 'kill -9 <process id>'
2. Login with VSQL via dbadmin user from any of the UP nodes and advance the AHM using below command
select make_ahm_now(true);
3. Now login to admintools from any of the UP nodes, and start the vertica process on down node i.e. v_verticatest_node0001. This will start recovery on node001
Monitor the recovery process by quering the below table
select * from recovery_status
Lets see how it goes.
[root@ip-10-15-242-174 ~]# ps -aef|grep verticadbadmin 1048 1 0 Oct14 ? 00:00:00 /bin/bash /opt/vertica/agent/agent.sh /opt/vertica/config/users/dbadmin/agent.conf
dbadmin 1060 1048 0 Oct14 ? 00:18:51 /opt/vertica/oss/python/bin/python ./simply_fast.py
dbadmin 1072 1 0 Oct14 ? 00:18:23 /opt/vconsole/vendor/oracle/java/jre/1.6/bin/java -Dvertica.home=/opt/vertica -Dvconsole.home=/opt/vconsole -Djava.library.path=/opt/vconsole/lib -Dderby.system.home=/opt/vconsole/mcdb/derby -Xmx2048m -Xms1024m -XX:MaxPermSize=256m -jar /opt/vconsole/lib/webui.war
I must kill all the proceses?