Removing unresponsive servers from the cluster
We have had non-responsive servers freeze the cluster In one case, the server responded to ping but not to ssh. In another case the server responded to ping and to ssh but we couldn't get in through ssh. Anyway, in both cases the cluster became un-responsive. In the second case, the cluster responded again once we sent a shutdown to the server. When the server was brought up, recovery has been slow (still going after 3+ days, under a terabyte of today data). In the first case, I can't shutdown (no ssh) so the cluster is unresponsive. I can run vsql but cannot run commands from vsql (select now(); hangs) I can run admintool from another server in the cluster but any command hangs. In particular the shutdown database command has hung. a) Anyway, if a server responds to ping but cannot be shutdown. How can I restore the cluster to functionality? b) If a server is taking inordinate amounts of time to recover Is there some way, perhaps that I can just restart with a fresh server? admintool won't allow me to remove a server when some nodes are down.
0
Comments
I had already tried 'select make_ahm_now(true)'. Thanks for suggesting it.
I tried restarting numerous times, replacing the server with another, clearing the data directory before such a replace, many other things.
Anyway tonight I was trading e-mails with Vertica support and
during the exchange the database went down entirely.
Not sure if it was caused by the diagnostics program or
something else we were trying.
When the database came back up, all nodes worked!
So it is still a mystery what was going wrong but iin
any case, all is fine now.
Brian