Options

Understand Why Vertica Node Went Down

Are there any guidelines to follow to understand why node went down?

Answers

  • Options
    moshegmosheg Vertica Employee Administrator

    See: https://www.vertica.com/blog/what-should-i-do-when-the-database-node-is-down
    And if needed please open a support case.
    In order to look for possible reason behind a crash support will probably guide you how to send a scrutinize log
    or request the vertica.log file messages from the time when node crashed.
    You can check for any Panic message or provide all the lines before the time of the node shutdown.
    Also provide "/var/log/messages" output of the same node just to see if anything unusual got reported there.

  • Options

    Thanks. Can you list steps on how to provide "/var/log/messages" output ?
    When I queried table error_messages, i saw 'Vertica suggests allowing 1 open file per MB of memory, minimum of 65536; see 'ulimit -n'’ before node went down.

  • Options
    SruthiASruthiA Vertica Employee Administrator

    /var/log/messages is a file. what is the output of ulimit -n?

  • Options

    SruthiA, the affected node shows 258284 for ulimit -n

  • Options

    I have checked output of /var/log/messages file and it does not have any relevant information related to why the node was down. Vertica log file has these entries prior to shutting node down:
    2019-08-27 11:58:17.711 EEcmdq:7f490cff9700 [Main] Handling signal: 11
    2019-08-27 11:58:18.000 DiskSpaceRefresher:7f4598e08700 [Util] Task 'DiskSpaceRefresher' enabled
    2019-08-27 11:58:18.103 EEcmdq:7f490cff9700 [Main] Received fatal signal SIGSEGV.
    2019-08-27 11:58:18.103 EEcmdq:7f490cff9700 [Main] Info: si_code: 1, si_pid: 93080, si_uid: 0, si_addr: 0x16b98
    2019-08-27 11:58:18.104 EEcmdq:7f490cff9700 @v_mydb_node0001: 01000/5439: Vertica suggests allowing 1 open file per MB of memory, minimum of 65536; see 'ulimit -n'
    No further entries appear after node was down.

    Investigating the issue more in dc_errors, I was able to identify transaction_id that is associated with above log error warning. The transaction refers to vertica process for building flex table. We have this process running fine for about 2 years now (btw, it ran fine today as well). I think that the error was the result of some other process that utilized too much memory so the flex table build resulted in error. Here are some other values reported in the dc_errors:
    error_level: 19
    line_number: 1012
    function_name: logWarnings
    message: Vertica suggests allowing 1 open file per MB of memory, minimum of 65536; see 'ulimit -n'
    error_code:64
    vertica_code:5439
    error_level_name:WARNING
    cursor_position:0
    I tried searching for the above error level and error_code/vertica_code numbers, but could not find any. Can someone help me out understanding what these mean?

  • Options
    SruthiASruthiA Vertica Employee Administrator

    Anuska, If possible, Could you please open a support case, we can dig into scrutinize and review it further.

  • Options

    Will do. Thanks for all your help.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file