one of Vertica nodes suddenly got down

Under /var/log/messages the following errors occurred:

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff81072f95>] exit_mm+0x95/0x180

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff810733df>] do_exit+0x15f/0x870

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff81073b48>] do_group_exit+0x58/0xd0

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff81088e16>] get_signal_to_deliver+0x1f6/0x460

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff8100a265>] do_signal+0x75/0x800

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff810ace0b>] ? sys_futex+0x7b/0x170

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff8100aa80>] do_notify_resume+0x90/0xc0

Jan  7 12:33:22 VerticaN2 kernel: [<ffffffff8100b341>] int_signal+0x12/0x17

Jan  7 12:33:50 VerticaN2 abrt[64596]: Saved core dump of pid 56954 (/opt/vertica/bin/vertica) to /var/spool/abrt/ccpp-2015-01-07-12:30:06-56954 (48282365952 bytes)

Jan  7 12:33:50 VerticaN2 abrtd: Directory 'ccpp-2015-01-07-12:30:06-56954' creation detected

Jan  7 12:33:57 VerticaN2 abrtd: Package 'vertica' isn't signed with proper key

Jan  7 12:33:57 VerticaN2 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-01-07-12:30:06-56954' exited with 1

Jan  7 12:33:57 VerticaN2 abrtd: Corrupted or bad directory '/var/spool/abrt/ccpp-2015-01-07-12:30:06-56954', deleting

does anyone has any clue what is it can be?

thnx!

Comments

  • Hi!

    ****
    This error is a kernel crash. If you did not upgrade the kernel, and it only started happening a couple of days ago, then it is probably a hardware fault. On the off chance that it is a software issue, try to upgrade your system to see if there are any kernel upgrades.
    ****
    Is it hosted on Amazon? 99% - yes, because Im familiar with this error.
    Don't use in AWS, but in physical machines. Its well known bug with old kernels(but I believe your system is updated and error occurs on IO) - Amazon gives bad hardware for Vertica images(especially hard drives).

  • thank you man :)
  • Hi Elad

    Did you also face crash issue on any of the vertica node during the same time when these messages were reported in the messages file? Also this is a kernel crash issue but just wanted to understand did it cause any issue on the vertica cluster as well?

    Regards
    Rahul 
  • no....
    furthermore, after it has happened we start up the server and everything was ok.

    this is the kernel version:
    2.6.32-358.el6.x86_64

    maybe it will help you to find some reason for that.

    thanks.
  • Hi Elad

    Thanks for the information.But since the heading says that one of the node suddenly got down that is the reason I asked for node crash issue.Can you kindly check in the vertica.log at the time node went down whether it went down due to a Panic message.Anything it wrote in the ErrorReport.txt?

    Also can you kindly provide us the snippet of the dblog (can  be found under catalog directory)?

    Regards
    Rahul
  • maybe this message can help:

    VerticaN2 kernel: INFO: task vertica:56998 blocked for more than 120 seconds.

  • one more critical fact:
    All 10 servers of Vertica DB cluster are physical HP DL380p machines!
  • Hi Elad

    VerticaN2 kernel: INFO: task vertica:56998 blocked for more than 120 seconds message itself is a pure indicator that some kernel issues were there on the machine related to disk IO which led to vertica failure because it kept blocking vertica process from running.Since this failure is non-vertica issue therefore will request you to go ahead & follow it up with the customers IT/hardware team for the issue as they might have noticed something unusual during the same time.


    Regards
    Rahul

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file