Recovering Vertica cluster after an EC2 instance failure
I have a 3-node CE cluster set up, running on EC2 using the HP Vertica AMI; AKA "Vertica Analytics Platform 7.0.1-0 (ami-e7b3b38e)".
One of my 3 nodes died (I cannot SSH into the box, I assume some kind of fatal hardware failure) and I want to replace this node with a new one. Is this failure mode recoverable by Vertica? Most of the info I found online seemed to need access to the catalog and other files from the failed node's hard drive (which in my case are 100% non-recoverable).
I created a fourth box on EC2 and I attempted to go through the steps outlined here, but noticed that I could not do this since step 2 involves adding a host (which I can't do without creating an illegal 4-node cluster on the Community Edition).
So I tried removing the down node first, which required me to:
One of my 3 nodes died (I cannot SSH into the box, I assume some kind of fatal hardware failure) and I want to replace this node with a new one. Is this failure mode recoverable by Vertica? Most of the info I found online seemed to need access to the catalog and other files from the failed node's hard drive (which in my case are 100% non-recoverable).
I created a fourth box on EC2 and I attempted to go through the steps outlined here, but noticed that I could not do this since step 2 involves adding a host (which I can't do without creating an illegal 4-node cluster on the Community Edition).
So I tried removing the down node first, which required me to:
MARK_DESIGN_KSAFE(0)which did work. Once that was finished, I moved on to the step of removing the host from the db using admintools, but it failed trying to SSH into the (dead) node:
# sudo -u dbadmin /opt/vertica/bin/admintools -t db_remove_node -d db -s 10.x.x.253When I attempted to drop the host using the curses-based admintools UI, I got an error dialog (a couple times in a row...) suggesting my password was incorrect, even though my dbadmin user does not have a password at all! This was the approach I used:
connecting to 10.x.x.253
Could not connect to database (EOF recieved)vsql: could not connect to server: Connection refused
Is the server running on host "10.x.x.253" and accepting
TCP/IP connections on port 5433?
connecting to 10.x.x.20
Error removing node(s) from database.
['All nodes must be UP before dropping a node']
# sudo -u dbadmin /opt/vertica/bin/admintools Advanced > Cluster Management > Remove Host > (x) db > 10.x.x.253 > "Are you sure you want to remove ['10.x.x.53'] from the database?" (yes) > "Enter the password" (blank) > "Error: unable to connect to the database Hint: Username or password could be invalid" > (printed again) "Error: unable to connect to the database Hint: Username or password could be invalid"
I saw some suggestions online about marking the node as ephemeral and then attempting to drop the node, which I tried next. I could mark it as ephemeral but I could not drop the node without first dropping all the projections which depend on the (failed) node.
dbadmin=> select node_name, is_ephemeral from nodes;
node_name | is_ephemeral
-----------------------+--------------
v_db_node0001 | t
v_db_node0002 | f
v_db_node0003 | f
dbadmin=> select REBALANCE_CLUSTER();
ERROR 2159: All nodes must be UP to rebalance a cluster
dbadmin=> drop node v_db_node0001;I finally attempted to edit admintools.conf to point to my new node (after running install_vertica on the new node with just one host passed to -s), but it still attempted to talk to the failed EC2 host during startup:
NOTICE 4927: The Segment segment of projection_b0 depends on Node v_db_node0001
ROLLBACK 3128: DROP failed due to dependencies
DETAIL: Cannot drop Node v_db_node0001 because other objects depend on it
HINT: Use DROP ... CASCADE to drop the dependent objects too
# sudo -u dbadmin /opt/vertica/bin/admintools -t start_db -d dbI have spent several hours with the documentation today and worked with a friend who is very familiar with Vertica, and we could not resolve this issue. What step am I missing in this process?
Info: no password specified, using none
Starting nodes:
v_db_node0002 (10.x.x.20)
v_db_node0003 (10.x.x.91)
Error: /opt/vertica/bin/vertica -V failed; vertica not installed on 10.x.x.253
Database start up failed. vertica not installed
0
Comments
Did you get past this? You could try spinning up three new nodes and using the vbr.py copy cluster to get two nodes of the database copied over, then let the third node recover.
--Sharon
Making a new cluster is the only solution I have found so far.
Configuring/adding a node can be done by running the following example command (as root) before running db_replace_node
[root@ip-10-0-0-111 ~]# /opt/vertica/sbin/install_vertica --add-hosts 10.0.0.114 -i ~/root.pem -Y --point-to-point --dba-user-password-disabled
10.0.0.111 is an existing node from which you execute this command, and 10.0.0.114 is the new node you want to create
The args are the same as you used when you setup the cluster.
(if you used -u or any additional args you need to use them in the above command too).
This sets up the dbadmin user with passwordless ssh. from here on admintools can run the commands needed to setup the new node.
** However, the above command will not work in your case.
Vertica requires that the node you are trying the replace be online. after it is replaced it can be killed.
in your case, the node is permanently down.
** There is a easier solution in your case.
Bring up a node with the exact same IP address of the lost node
The AWS Launch Wizard has a option (Network Interfaces) where you can give the node the same private IP of the lost node.
# rerun the installer with the following command. you may omit the --hosts, and the installer will configure all the hosts that were previously setup
# you might have to edit the /root/.ssh/known_hosts and /home/dbadmin/.ssh/known_hosts and delete the line for the new host you are adding else you may get an error from ssh "REMOTE HOST IDENTIFICATION CHANGED"
# you might also have to create an empty catalog directory on the new node
# start the vertica process on the newly created node and the node will recover from stratch
[root@ip-10-0-0-111 ~]#/opt/vertica/sbin/install_vertica -i ~/root.pem -Y --point-to-point --dba-user-password-disabled
[dbadmin@ip-10-0-0-112 ~]$ mkdir -p /vertica/data/test/v_test_node0002_catalog # create a empty catalog directory on that node
[dbadmin@ip-10-0-0-111 ~]$ admintools -t restart_node -s 10.0.0.112 -d test # and it will recover from scratch.
Note: you must not lower the ksafety
This is not a CE specific limitation.