cannot replace a permanently down node. update_vertica fails with host is unreachable

skeswaniskeswani - Select Field - Employee
edited October 2019 in General Discussion

update_vertica/install_vertica --remove-host/--add-host fails because host is unreachable. I cannot replace a permanently down node.

Yikes! a node went down permanently
And I want to replace that node with a new node (which has a different IP).
When this happens, I first tried to add a node to the cluster.
I run update_vertica --add-host new_host It fails saying a old_node is down
However, if I try
update_vertica --remove-host old_host It fails saying a old_node is part of the test2 database.

I am stuck, i cannot replace the node !
can someone help me

Answers

  • skeswaniskeswani - Select Field - Employee
    edited October 2019

    You want to replace a old/dead node with a new node (which has a different IP).
    First of all, do NOT re-balance, that is not required here and is the wrong solution to the problem.

    Here is a step by step example on how to go about replacing a permanently down node

    consider a cluster where one node is down and you want to replace this node 10.11.12.24 (dead/old node) with 10.11.12.30 (new node) that you just setup.

    take a new node with new IP 10.11.12.30 and set it up as a single node cluster

    [dbadmin@ip-10-11-12-30 ~]$ sudo /opt/vertica/sbin/install_vertica -s 10.11.12.30 --clean <== PROVIDE SAME ARGS FROM CLUSTER *** (on node 10.11.12.10 do grep install_opts /opt/vertica/config/admintools.conf)***
    Vertica Analytic Database 9.2.1-1 Installation Tool
    ...
    Installation complete.

    Make sure passwordless ssh is setup correctly between this new node and all nodes of the existing cluster for user dbadmin

    Edit the admintools.conf on all nodes of the existing cluster to make a reference to the new node

    node (10.11.12.24) is dead and gone.

    dbadmin=> select node_address, node_state from nodes;
    node_address | node_state
    --------------+------------
    10.11.12.10 | UP
    10.11.12.20 | UP
    10.11.12.24 | DOWN
    (3 rows)

    modify the admintools.conf file to add the new node as shown below
    original

    [dbadmin@ip-10-11-12-10 ~]$ grep -A 1 "[Cluster]" /opt/vertica/config/admintools.conf
    [Cluster]
    hosts = 10.11.12.10,10.11.12.20,10.11.12.24

    new = now this file has the extra node IP address added and a reference to the new node you have setup (10.11.12.30)

    [dbadmin@ip-10-11-12-10 ~]$ grep -A 1 "[Cluster]" /opt/vertica/config/admintools.conf
    [Cluster]
    hosts = 10.11.12.10,10.11.12.20,10.11.12.24,10.11.12.30 <== THIS LINE IS APPENDED TO ADD HOST 10.11.12.30
    [dbadmin@ip-10-11-12-10 ~]$ grep -A 4 "[Nodes]" /opt/vertica/config/admintools.conf
    [Nodes]
    v_test2_node0001 = 10.11.12.10,/vertica/data,/vertica/data
    v_test2_node0002 = 10.11.12.20,/vertica/data,/vertica/data
    v_test2_node0003 = 10.11.12.24,/vertica/data,/vertica/data
    v_test2_node0004 = 10.11.12.30,/vertica/data,/vertica/data <== THIS LINE IS ADDED, note its say node0004

    distribute this newly modified admintools.conf file to all nodes

    [dbadmin@ip-10-11-12-10 ~]$ admintools -t distribute_config_files
    Initiating admintools.conf distribution...
    Could not send admintools.conf to all nodes in cluster.
    Hint: Is passwordless ssh configured correctly?
    Error message:
    Could not copy file to host 10.11.12.24 <=== THIS IS EXPECTED TO FAIL, IGNORE IT

    check to make sure admintools.conf was distributed correctly. Notice the new node here

    [dbadmin@ip-10-11-12-10 ~]$ for node in 10.11.12.10 10.11.12.20 10.11.12.30; do ssh $node md5sum /opt/vertica/config/admintools.conf ; done
    0b2973050e63e121744fc89004d1b3ab /opt/vertica/config/admintools.conf
    0b2973050e63e121744fc89004d1b3ab /opt/vertica/config/admintools.conf
    0b2973050e63e121744fc89004d1b3ab /opt/vertica/config/admintools.conf

    force a recovery and a node replacement

    [dbadmin@ip-10-11-12-10 ~]$ admintools -t db_replace_node -o 10.11.12.24 -n 10.11.12.30 -d test2
    Replicating configuration to all nodes
    Starting database on replacment host
    Restarting host [10.11.12.30] with catalog [v_test2_node0003_catalog]
    Issuing multi-node restart
    Starting nodes:
    v_test2_node0003 (10.11.12.30)
    Starting Vertica on all nodes. Please wait, databases with a large catalog may take a while to initialize.
    Node Status: v_test2_node0001: (UP) v_test2_node0003: (DOWN)
    Node Status: v_test2_node0001: (UP) v_test2_node0003: (RECOVERING)
    Node Status: v_test2_node0001: (UP) v_test2_node0003: (UP)
    Checking database state
    Node Status: v_test2_node0001: (UP) v_test2_node0002: (UP) v_test2_node0003: (UP)
    Deleting catalog and data directories
    Error(s) detected while deleting catalog and data directories: Host: 10.11.12.24 Reported error in removal <== EXPECTED, IGNORE ( The database directories will need to be removed manually from 10.11.12.24)

    voila !

    dbadmin=> select node_address, node_state from nodes;
    node_address | node_state
    --------------+------------
    10.11.12.10 | UP
    10.11.12.20 | UP
    10.11.12.30 | UP
    (3 rows)

    Finally Clean up the admintools conf

    [dbadmin@ip-10-11-12-10 ~]$ grep -A 1 "[Cluster]" /opt/vertica/config/admintools.conf
    [Cluster]
    hosts = 10.11.12.10,10.11.12.20,,10.11.12.30 <== Removed dead node 10.11.12.24
    [dbadmin@ip-10-11-12-10 ~]$ grep -A 4 "[Nodes]" /opt/vertica/config/admintools.conf
    [Nodes]
    v_test2_node0001 = 10.11.12.10,/vertica/data,/vertica/data
    v_test2_node0002 = 10.11.12.20,/vertica/data,/vertica/data
    v_test2_node0004 = 10.11.12.30,/vertica/data,/vertica/data <== REMOVE THIS LINE IS, you had added before, its redundant now
    v_test2_node0003 = 10.11.12.30,/vertica/data,/vertica/data

    Distribute the admintools conf

    [dbadmin@ip-10-11-12-10 ~]$ admintools -t distribute_config_files
    Initiating admintools.conf distribution...
    Local admintools.conf sent to all nodes in the cluster.
    [dbadmin@ip-10-11-12-10 ~]$ for node in 10.11.12.10 10.11.12.20 10.11.12.30; do ssh $node md5sum /opt/vertica/config/admintools.conf ; done
    e2c3e1a650afe4958034374c096a5881 /opt/vertica/config/admintools.conf
    e2c3e1a650afe4958034374c096a5881 /opt/vertica/config/admintools.conf
    e2c3e1a650afe4958034374c096a5881 /opt/vertica/config/admintools.conf

  • RaviRavi Vertica Employee Employee

    Thanks Sumeet this is great Solution. I request we should have an option in admintools to replace a un-reachable node to a new node.

  • chaimachaima Vertica Employee Employee

    Thanks Sumeet for sharing!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file