Recovering Vertica cluster after an EC2 instance failure

I have a 3-node CE cluster set up, running on EC2 using the HP Vertica AMI; AKA "Vertica Analytics Platform 7.0.1-0 (ami-e7b3b38e)".

One of my 3 nodes died (I cannot SSH into the box, I assume some kind of fatal hardware failure) and I want to replace this node with a new one.  Is this failure mode recoverable by Vertica?  Most of the info I found online seemed to need access to the catalog and other files from the failed node's hard drive (which in my case are 100% non-recoverable).

I created a fourth box on EC2 and I attempted to go through the steps outlined here, but noticed that I could not do this since step 2 involves adding a host (which I can't do without creating an illegal 4-node cluster on the Community Edition).

So I tried removing the down node first, which required me to:
MARK_DESIGN_KSAFE(0)
which did work.  Once that was finished, I moved on to the step of removing the host from the db using admintools, but it failed trying to SSH into the (dead) node:

# sudo -u dbadmin /opt/vertica/bin/admintools -t db_remove_node -d db -s 10.x.x.253
connecting to 10.x.x.253
Could not connect to database (EOF recieved)vsql: could not connect to server: Connection refused
   Is the server running on host "10.x.x.253" and accepting
   TCP/IP connections on port 5433?
connecting to 10.x.x.20
Error removing node(s) from database.
['All nodes must be UP before dropping a node']
When I attempted to drop the host using the curses-based admintools UI, I got an error dialog (a couple times in a row...) suggesting my password was incorrect, even though my dbadmin user does not have a password at all!  This was the approach I used:
# sudo -u dbadmin /opt/vertica/bin/admintools  Advanced >   Cluster Management >   Remove Host >   (x) db >    10.x.x.253 >   "Are you sure you want to remove ['10.x.x.53'] from the database?" (yes) >  "Enter the password" (blank) >  "Error: unable to connect to the database Hint: Username or password could be invalid" >  (printed again) "Error: unable to connect to the database Hint: Username or password could be invalid"  


I saw some suggestions online about marking the node as ephemeral and then attempting to drop the node, which I tried next.  I could mark it as ephemeral but I could not drop the node without first dropping all the projections which depend on the (failed) node.

dbadmin=> select node_name, is_ephemeral from nodes;
       node_name       | is_ephemeral 
-----------------------+--------------
 v_db_node0001 | t
  v_db_node0002 | f
  v_db_node0003 | f 
dbadmin=> select REBALANCE_CLUSTER();
ERROR 2159:  All nodes must be UP to rebalance a cluster
dbadmin=> drop node v_db_node0001;
NOTICE 4927:  The Segment segment of projection_b0 depends on Node v_db_node0001
ROLLBACK 3128:  DROP failed due to dependencies
DETAIL:  Cannot drop Node v_db_node0001 because other objects depend on it
HINT:  Use DROP ... CASCADE to drop the dependent objects too
I finally attempted to edit admintools.conf to point to my new node (after running install_vertica on the new node with just one host passed to -s), but it still attempted to talk to the failed EC2 host during startup:
# sudo -u dbadmin /opt/vertica/bin/admintools -t start_db -d db
Info: no password specified, using none
        Starting nodes: 
                v_db_node0002 (10.x.x.20)
                v_db_node0003 (10.x.x.91)
Error: /opt/vertica/bin/vertica -V failed; vertica not installed on 10.x.x.253
Database start up failed.  vertica not installed 
I have spent several hours with the documentation today and worked with a friend who is very familiar with Vertica, and we could not resolve this issue.  What step am I missing in this process?

Comments

  • This sounds like a short-coming of the Community Edition if the "replacing a failed node" instructions don't work.  It's a highly available cluster as long as all three nodes remain available? :-)

    Did you get past this?  You could try spinning up three new nodes and using the vbr.py copy cluster to get two nodes of the database copied over, then let the third node recover. 

      --Sharon

  • As far as I could tell there was physically no way to repair the cluster once one of the machines has a permanent hardware failure (at least I could not find the correct steps).

    Making a new cluster is the only solution I have found so far.
  • The root cause of this issue is that the dbadmin users was not setup on the new node. AWS requires ssh/key based authentication and there is no way admintools can login to the new node and run commands needed to setup of new node without this auth (pre-)configured.

    Configuring/adding a node can be done by running the following example command (as root) before running db_replace_node 

    [root@ip-10-0-0-111 ~]# /opt/vertica/sbin/install_vertica  --add-hosts 10.0.0.114 -i ~/root.pem -Y --point-to-point --dba-user-password-disabled
    10.0.0.111 is an existing node from which you execute this command, and 10.0.0.114 is the new node you want to create

    The args are the same as you used when you setup the cluster. 
    (if you used -u or any additional args you need to use them in the above command too). 
    This sets up the dbadmin user with passwordless ssh. from here on admintools can run the commands needed to setup the new node.

    ** However, the above command will not work in your case. 
    Vertica requires that the node you are trying the replace be online. after it is replaced it can be killed.
    in your case, the node is permanently down.



    ** There is a easier solution in your case.

    Bring up a node with the exact same IP address of the lost node
    The AWS Launch Wizard has a option (Network Interfaces) where you can give the node the same private IP of the lost node.

    # rerun the installer with the following command. you may omit the --hosts, and the installer will configure all the hosts that were previously setup
    # you might have to edit the /root/.ssh/known_hosts and /home/dbadmin/.ssh/known_hosts and delete the line for the new host you are adding else you may get an error  from ssh "REMOTE HOST IDENTIFICATION CHANGED"
    # you might also have to create an empty catalog directory on the new node

    # start the vertica process on the  newly created node and the node will recover from stratch

    [root@ip-10-0-0-111 ~]#/opt/vertica/sbin/install_vertica   -i ~/root.pem -Y --point-to-point --dba-user-password-disabled
    [dbadmin@ip-10-0-0-112 ~]$ mkdir -p /vertica/data/test/v_test_node0002_catalog # create a empty catalog directory on that node
    [dbadmin@ip-10-0-0-111 ~]$ admintools -t restart_node -s 10.0.0.112 -d test    # and it will recover from scratch.

    Note: you must not lower the ksafety

    This is not a CE specific limitation.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file