Recovering Vertica cluster after an EC2 instance failure
One of my 3 nodes died (I cannot SSH into the box, I assume some kind of fatal hardware failure) and I want to replace this node with a new one. Is this failure mode recoverable by Vertica? Most of the info I found online seemed to need access to the catalog and other files from the failed node's hard drive (which in my case are 100% non-recoverable).
I created a fourth box on EC2 and I attempted to go through the steps outlined here, but noticed that I could not do this since step 2 involves adding a host (which I can't do without creating an illegal 4-node cluster on the Community Edition).
So I tried removing the down node first, which required me to:
MARK_DESIGN_KSAFE(0)which did work. Once that was finished, I moved on to the step of removing the host from the db using admintools, but it failed trying to SSH into the (dead) node:
# sudo -u dbadmin /opt/vertica/bin/admintools -t db_remove_node -d db -s 10.x.x.253When I attempted to drop the host using the curses-based admintools UI, I got an error dialog (a couple times in a row...) suggesting my password was incorrect, even though my dbadmin user does not have a password at all! This was the approach I used:
connecting to 10.x.x.253
Could not connect to database (EOF recieved)vsql: could not connect to server: Connection refused
Is the server running on host "10.x.x.253" and accepting
TCP/IP connections on port 5433?
connecting to 10.x.x.20
Error removing node(s) from database.
['All nodes must be UP before dropping a node']
# sudo -u dbadmin /opt/vertica/bin/admintools Advanced > Cluster Management > Remove Host > (x) db > 10.x.x.253 > "Are you sure you want to remove ['10.x.x.53'] from the database?" (yes) > "Enter the password" (blank) > "Error: unable to connect to the database Hint: Username or password could be invalid" > (printed again) "Error: unable to connect to the database Hint: Username or password could be invalid"
I saw some suggestions online about marking the node as ephemeral and then attempting to drop the node, which I tried next. I could mark it as ephemeral but I could not drop the node without first dropping all the projections which depend on the (failed) node.
dbadmin=> select node_name, is_ephemeral from nodes;
node_name | is_ephemeral
v_db_node0001 | t
v_db_node0002 | f
v_db_node0003 | f
dbadmin=> select REBALANCE_CLUSTER();
ERROR 2159: All nodes must be UP to rebalance a cluster
dbadmin=> drop node v_db_node0001;I finally attempted to edit admintools.conf to point to my new node (after running install_vertica on the new node with just one host passed to -s), but it still attempted to talk to the failed EC2 host during startup:
NOTICE 4927: The Segment segment of projection_b0 depends on Node v_db_node0001
ROLLBACK 3128: DROP failed due to dependencies
DETAIL: Cannot drop Node v_db_node0001 because other objects depend on it
HINT: Use DROP ... CASCADE to drop the dependent objects too
# sudo -u dbadmin /opt/vertica/bin/admintools -t start_db -d dbI have spent several hours with the documentation today and worked with a friend who is very familiar with Vertica, and we could not resolve this issue. What step am I missing in this process?
Info: no password specified, using none
Error: /opt/vertica/bin/vertica -V failed; vertica not installed on 10.x.x.253
Database start up failed. vertica not installed