Vertica node recovery stuck

aakashaakash Registered User
edited June 21 in Vertica Forum

Hello,
We have a 3 node cluster, out of which we had to restart one of the node to clear up some zombie sessions (it wouldn't go away with close_session).

After the restart, we see that the node is repeatedly getting stuck in RECOVERY phase for long time and eventually goes back to DOWN state.

While searching through vertica.log, we see below errors for multiple tables
2019-06-21 04:55:27.744 RecoverTable:7eff617f5700 @v_db_node0001: 01000/6772: Fail to recover table table-name due to error: Could not stop all dirty transactions[txnId = 49539595915837339; ]

Can someone please guide in how we can get this node to recover and get back in UP state?

Comments

  • skeswaniskeswani Employee, Registered User, VerticaExpert

    There is a transaction running (49539595915837339), that started before the node recovery started, but has not committed. If you kill the Tx then the nodes should recover.
    If the TX is hung, then you will need to kill the node on which its running. Which is tricky with a 3 node db.

  • aakashaakash Registered User
    edited June 21

    Does it mean that for the node to recover, all the transactions need to be quiesced? We have some jobs that periodically push data to Vertica.

    I also found out that the transactionid in question is a zombie session that was started on 06-06-2019. Unfortunately, close_session on that sessionid is not working. Is there any other way to force kill the session? don't want to restart the node as we have only 2 nodes up at the moment.

    P.S. - don't see any errors in vertica.log when we attempt to close_session.

  • skeswaniskeswani Employee, Registered User, VerticaExpert

    No you do not need to quiesce any TX. but a TX has to rollback or commit. i.e. it has to be in or out for the recovering node to keep the data or throw it out. It cannot be open perpetually.

    A recovering node is aware of a TX that started after recovery started. But it needs to know the final state of a TX that started before recovery.

    Its rather precarious with a 3 node db. Since i believe you will need to restart the db if you cannot kill that hung TX.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file