Options

Problem with restart database

I've HP Vertica cluster of three nodes. Two of them are working properly and the third is in the initializing mode. I would like to completely restart the database. How can I take out a third node of this state?

Comments

  • Options
    Hi,

    Which vertica version are you using.
    I suppose your database is ksafe=1.
    There could be some data corruption issue on problematic node.
    You can kill the vertica process on the problematic node and can try to restart the node forcefully. Can use below command
    /opt/vertica/bin/admintools -t restart_node -s <this_Hostname_or_IP> -d dbname -p password -F

    If this does not solve the issue then we need to look into the vertica.log of the problematic node.

    Thanks & Regards,
    Shobhit
  • Options
    I've v7.0.0-1.How I can test ksafe?
  • Options
    select * from system
    Look for current_fault_tolerance column, if it is 1 then your database is ksafe=1


  • Options
    Thanks! Can I take out the node from cluster and add the node  again?
  • Options
    Nimmi_guptaNimmi_gupta - Select Field - Employee
    If it's ksafe=1, yes you can take out the node. But before you go for removing and adding back the same node why don't you try to bring the down node from scratch. That means, from the down node rename the data and catalog folder. Go to one of the up node and run select make_ahm_now('true'). And then use the admintools to bring the down node up. Hope this helps. Try and let us know the result.
  • Options
    Thanks to your information, but I've had the ksafe test result:
    select current_fault_tolerance  from system;
    WARNING 4539:  Received no response from v_verticatest_node0001 in get cluster LGE
     current_fault_tolerance
    -------------------------
                           0
    (1 row)
    Can I remove node from cluster?

  • Options
    can you show us the output of
    select * from nodes and
    select * from system;

  • Options
    Yes, I take out ip and node_id:
    verticatest=> select node_name,node_state,catalog_path from nodes;       node_name        | node_state |                          catalog_path
    ------------------------+------------+-----------------------------------------------------------------
     v_verticatest_node0001 | DOWN       | /mnt/vertica/verticatest/v_verticatest_node0001_catalog/Catalog
     v_verticatest_node0002 | UP         | /mnt/vertica/verticatest/v_verticatest_node0002_catalog/Catalog
     v_verticatest_node0003 | UP         | /mnt/vertica/verticatest/v_verticatest_node0003_catalog/Catalog
    (3 rows)
    verticatest=> select * from system;WARNING 4539:  Received no response from v_verticatest_node0001 in get cluster LGE
     current_epoch | ahm_epoch | last_good_epoch | refresh_epoch | designed_fault_tolerance | node_count | node_down_count | current_fault_tolerance | catalog_revision_number | wos_used_bytes | wos_row_count | ros_used_bytes | ros_row_count | total_used_bytes | total_row_count
    ---------------+-----------+-----------------+---------------+--------------------------+------------+-----------------+-------------------------+-------------------------+----------------+---------------+----------------+---------------+------------------+-----------------
             56782 |     56780 |           56781 |            -1 |                        1 |          3 |               1 |                       0 |                  113867 |              0 |             0 |      375718735 |      38563730 |        375718735 |        38563730
    (1 row)
    May be I take out some fields from last query?


  • Options
    So, your cluster is designed with designed_fault_tolerance=1, but since currently one node is down, so current_fault_tolerance=0, which is expected.
    Before remove the node, I would like you to please do the following, may be your node001 come up and you do not require to remove it.
    1. Execute the below command on v_verticatest_node0001
      ps -aef|grep vertica | grep C on
     If it returns process, kill that process with 'kill -9 <process id>'

    2. Login with VSQL via dbadmin user from any of the UP nodes and advance the AHM using below command
    select make_ahm_now(true);

    3. Now login to admintools from any of the UP nodes, and start the vertica process on down node i.e. v_verticatest_node0001. This will start recovery on node001

    Monitor the recovery process by quering the below table
    select * from recovery_status

    Lets see how it goes.


  • Options
    After first command I see:
    [root@ip-10-15-242-174 ~]# ps -aef|grep verticadbadmin   1048     1  0 Oct14 ?        00:00:00 /bin/bash /opt/vertica/agent/agent.sh /opt/vertica/config/users/dbadmin/agent.conf
    dbadmin   1060  1048  0 Oct14 ?        00:18:51 /opt/vertica/oss/python/bin/python ./simply_fast.py
    dbadmin   1072     1  0 Oct14 ?        00:18:23 /opt/vconsole/vendor/oracle/java/jre/1.6/bin/java -Dvertica.home=/opt/vertica -Dvconsole.home=/opt/vconsole -Djava.library.path=/opt/vconsole/lib -Dderby.system.home=/opt/vconsole/mcdb/derby -Xmx2048m -Xms1024m -XX:MaxPermSize=256m -jar /opt/vconsole/lib/webui.war
    I must kill all the proceses?

  • Options
    I think there are no relevant processes to be killed. Kindly start with step 2 onwards
  • Options
    Ok, thanks!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file