Problem with restart database

Igor_Trosko · October 2014

I've HP Vertica cluster of three nodes. Two of them are working properly and the third is in the initializing mode. I would like to completely restart the database. How can I take out a third node of this state?

Shobhit_Garg · October 2014

Hi,

Which vertica version are you using.
I suppose your database is ksafe=1.
There could be some data corruption issue on problematic node.
You can kill the vertica process on the problematic node and can try to restart the node forcefully. Can use below command
/opt/vertica/bin/admintools -t restart_node -s <this_Hostname_or_IP> -d dbname -p password -F

If this does not solve the issue then we need to look into the vertica.log of the problematic node.

Thanks & Regards,
Shobhit

Igor_Trosko · October 2014

I've v7.0.0-1.How I can test ksafe?

Shobhit_Garg · October 2014

select * from system
Look for current_fault_tolerance column, if it is 1 then your database is ksafe=1

Igor_Trosko · October 2014

Thanks! Can I take out the node from cluster and add the node again?

Nimmi_gupta · October 2014

If it's ksafe=1, yes you can take out the node. But before you go for removing and adding back the same node why don't you try to bring the down node from scratch. That means, from the down node rename the data and catalog folder. Go to one of the up node and run select make_ahm_now('true'). And then use the admintools to bring the down node up. Hope this helps. Try and let us know the result.

Igor_Trosko · October 2014

Thanks to your information, but I've had the ksafe test result:
select current_fault_tolerance from system;
WARNING 4539: Received no response from v_verticatest_node0001 in get cluster LGE
current_fault_tolerance
-------------------------
0
(1 row)
Can I remove node from cluster?

Shobhit_Garg · October 2014

can you show us the output of
select * from nodes and
select * from system;

Igor_Trosko · October 2014

Yes, I take out ip and node_id:
verticatest=> select node_name,node_state,catalog_path from nodes; node_name | node_state | catalog_path
------------------------+------------+-----------------------------------------------------------------
v_verticatest_node0001 | DOWN | /mnt/vertica/verticatest/v_verticatest_node0001_catalog/Catalog
v_verticatest_node0002 | UP | /mnt/vertica/verticatest/v_verticatest_node0002_catalog/Catalog
v_verticatest_node0003 | UP | /mnt/vertica/verticatest/v_verticatest_node0003_catalog/Catalog
(3 rows)
verticatest=> select * from system;WARNING 4539: Received no response from v_verticatest_node0001 in get cluster LGE
current_epoch | ahm_epoch | last_good_epoch | refresh_epoch | designed_fault_tolerance | node_count | node_down_count | current_fault_tolerance | catalog_revision_number | wos_used_bytes | wos_row_count | ros_used_bytes | ros_row_count | total_used_bytes | total_row_count
---------------+-----------+-----------------+---------------+--------------------------+------------+-----------------+-------------------------+-------------------------+----------------+---------------+----------------+---------------+------------------+-----------------
56782 | 56780 | 56781 | -1 | 1 | 3 | 1 | 0 | 113867 | 0 | 0 | 375718735 | 38563730 | 375718735 | 38563730
(1 row)
May be I take out some fields from last query?

Shobhit_Garg · October 2014

So, your cluster is designed with designed_fault_tolerance=1, but since currently one node is down, so current_fault_tolerance=0, which is expected.
Before remove the node, I would like you to please do the following, may be your node001 come up and you do not require to remove it.
1. Execute the below command on v_verticatest_node0001
ps -aef|grep vertica | grep C on
If it returns process, kill that process with 'kill -9 <process id>'

2. Login with VSQL via dbadmin user from any of the UP nodes and advance the AHM using below command
select make_ahm_now(true);

3. Now login to admintools from any of the UP nodes, and start the vertica process on down node i.e. v_verticatest_node0001. This will start recovery on node001

Monitor the recovery process by quering the below table
select * from recovery_status

Lets see how it goes.

Igor_Trosko · October 2014

After first command I see:
[root@ip-10-15-242-174 ~]# ps -aef|grep verticadbadmin 1048 1 0 Oct14 ? 00:00:00 /bin/bash /opt/vertica/agent/agent.sh /opt/vertica/config/users/dbadmin/agent.conf
dbadmin 1060 1048 0 Oct14 ? 00:18:51 /opt/vertica/oss/python/bin/python ./simply_fast.py
dbadmin 1072 1 0 Oct14 ? 00:18:23 /opt/vconsole/vendor/oracle/java/jre/1.6/bin/java -Dvertica.home=/opt/vertica -Dvconsole.home=/opt/vconsole -Djava.library.path=/opt/vconsole/lib -Dderby.system.home=/opt/vconsole/mcdb/derby -Xmx2048m -Xms1024m -XX:MaxPermSize=256m -jar /opt/vconsole/lib/webui.war
I must kill all the proceses?

Shobhit_Garg · October 2014

I think there are no relevant processes to be killed. Kindly start with step 2 onwards

Igor_Trosko · October 2014

Ok, thanks!

We're Moving!

Create My New Community Account Now

Problem with restart database

Comments

Leave a Comment