Can't initialize Vertica in 3 nodes cluster after 1 machine down

Hi, I have a 3 nodes cluster, 1 of the machine is out of service because I need to reinstall however I am assuming that the other machines should work properly. I have the following messages from vertica.log in one of machines that should be working. WARNING:pyinotify:Unable to retrieve Watch object associated to <_RawEvent cookie=0 mask=0x8000 name='' wd=1 > 2013-08-21 10:04:48.123 Poll dispatch:0x6290450 [Comms] error SP_receive: Connection closed by spread 2013-08-21 10:04:48.123 Poll dispatch:0x6290450 [Comms] error SP_receive: The network socket experienced an error. This Spread mailbox will no longer work until the connection is disconnected and then reconnected 2013-08-21 22:21:00.232 Timer Service:0x7b566c0 @v_testdb_node0002: 00000/5021: Timer service done; closing session 2013-08-21 22:21:00.733 Main:0x5fec600 @v_testdb_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:9 Event Severity: Informational [6] PostedTimestamp: 2013-08-21 22:21:00.733169 ExpirationTimestamp: 2081-09-09 00:35:07.733169 EventCodeDescription: Node State Change ProblemDescription: Changing node v_testdb_node0002 startup state to SHUTDOWN_ERROR DatabaseName: testdb Hostname: avert02 2013-08-21 22:21:00.733 Main:0x5fec600 [Recover] Changing node v_testdb_node0002 startup state from INITIALIZING to SHUTDOWN_ERROR 2013-08-21 22:21:00.733 Main:0x5fec600 [Txn] Begin Txn: b00000000043f5 'Recovery: Get last good epoch' 2013-08-21 22:21:00.733 Main:0x5fec600 [Txn] Starting Commit: Txn: b00000000043f5 'Recovery: Get last good epoch' 2013-08-21 22:21:00.734 Main:0x5fec600 [Txn] Commit Complete: Txn: b00000000043f5 at epoch 0xf 2013-08-21 22:21:00.734 Main:0x5fec600 [Recover] Manual recovery possible: Last good epoch=0xe Do I need to remove the broken node reference (/opt/vertica/sbin/update_vertica -R)? Vertica do not ignore a failing machine? Tkx

Comments

  • Hi Wils, This looks to me like two simultaneous issues: One, you have a node that's not working; two, it looks like your cluster crashed (?) or otherwise didn't cleanly shut down, and it's indicating that the most-recent operation was saved to disk on one of your two nodes but not both. It's that latter issue that's causing trouble: If your third node were to come back, and if it had the change in question, then the database could do a full recovery. Since Vertica doesn't know what's going on with the third node, it doesn't know whether to wait for it or to throw out the operation and resume starting one epoch back. See the "Failure Recovery" section of the Administrator's Guide (https://my.vertica.com/docs/6.1.x/PDF/HP_Vertica_6.1.x_AdminGuide.pdf) for details on how to deal with this scenario. Adam
  • Hey Adam, So, as I am just testing I decided to reinstall the problematic node, there was some problem with the OS. Anyways...OS reinstalled, other nodes restarted and everything works fine =) Now I can move forward with my tests :P Thank you

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file