Removing unresponsive servers from the cluster

Brian_Lau · January 2014

We have had non-responsive servers freeze the cluster In one case, the server responded to ping but not to ssh. In another case the server responded to ping and to ssh but we couldn't get in through ssh. Anyway, in both cases the cluster became un-responsive. In the second case, the cluster responded again once we sent a shutdown to the server. When the server was brought up, recovery has been slow (still going after 3+ days, under a terabyte of today data). In the first case, I can't shutdown (no ssh) so the cluster is unresponsive. I can run vsql but cannot run commands from vsql (select now(); hangs) I can run admintool from another server in the cluster but any command hangs. In particular the shutdown database command has hung. a) Anyway, if a server responds to ping but cannot be shutdown. How can I restore the cluster to functionality? b) If a server is taking inordinate amounts of time to recover Is there some way, perhaps that I can just restart with a fresh server? admintool won't allow me to remove a server when some nodes are down.

Vivek_V__Kumar · January 2014

1) What is k-safety of the dataase.

Please heck for that particular node if hostname is included in the list of known hosts.

Log in to the host from the command line:

> ssh node01

Warning: Permanently added '...' (RSA) to the list of known hosts.

Last login: Thu Feb 22 21:29:31 2007 from ...

> exit

If the workaround does not solve the problem, check your SSH configuration as described in Enable Secure Shell (SSH) Logins in the Installation Guide.

2) The issue you ran into looks like replay deletes. These are probably the worst offenders of time consumption for both recovery and rebalance (adding /removing nodes). For performance reasons when you delete or update records they are marked for deletion and their actual removal which entails creating new ros containers is deferred. These build up and there are mechanisms in place in the tuple mover to purge them out over time. Recovery has to process those marked for deletion records and this involves sorting of the remaining records into new containers. The more deletes and the bigger the table the longer it takes. There's a delete_vectors table that you can query to see how many records are marked for deletion and the ratio of data to deletes.

Please follow below steps for faster recovery in case of replay deletes:

- Kill the vertica process for the node that is recovering.

- Connect to one of the up nodes and execute "select make_ahm_now(true);" This will advance the AHM which will minimize replay delete. Because the node was last up prior to the AHM, the node will need to recover from scratch as opposed to doing an incremental recovery.

- Restart vertica on the node that needs to recover

Brian_Lau · January 2014

Thank you for your reply.
I had already tried 'select make_ahm_now(true)'. Thanks for suggesting it.
I tried restarting numerous times, replacing the server with another, clearing the data directory before such a replace, many other things.

Anyway tonight I was trading e-mails with Vertica support and
during the exchange the database went down entirely.
Not sure if it was caused by the diagnostics program or
something else we were trying.
When the database came back up, all nodes worked!

So it is still a mystery what was going wrong but iin
any case, all is fine now.

Brian

We're Moving!

Create My New Community Account Now

Removing unresponsive servers from the cluster

Comments

Leave a Comment