one of Vertica nodes suddenly got down
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff81072f95>] exit_mm+0x95/0x180
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff810733df>] do_exit+0x15f/0x870
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff81073b48>] do_group_exit+0x58/0xd0
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff81088e16>] get_signal_to_deliver+0x1f6/0x460
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff8100a265>] do_signal+0x75/0x800
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff810ace0b>] ? sys_futex+0x7b/0x170
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff8100aa80>] do_notify_resume+0x90/0xc0
Jan 7 12:33:22 VerticaN2 kernel: [<ffffffff8100b341>] int_signal+0x12/0x17
Jan 7 12:33:50 VerticaN2 abrt[64596]: Saved core dump of pid 56954 (/opt/vertica/bin/vertica) to /var/spool/abrt/ccpp-2015-01-07-12:30:06-56954 (48282365952 bytes)
Jan 7 12:33:50 VerticaN2 abrtd: Directory 'ccpp-2015-01-07-12:30:06-56954' creation detected
Jan 7 12:33:57 VerticaN2 abrtd: Package 'vertica' isn't signed with proper key
Jan 7 12:33:57 VerticaN2 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-01-07-12:30:06-56954' exited with 1
Jan 7 12:33:57 VerticaN2 abrtd: Corrupted or bad directory '/var/spool/abrt/ccpp-2015-01-07-12:30:06-56954', deleting
does anyone has any clue what is it can be?
thnx!
Comments
****
This error is a kernel crash. If you did not upgrade the kernel, and it only started happening a couple of days ago, then it is probably a hardware fault. On the off chance that it is a software issue, try
to upgrade
your system to see if there are any kernel upgrades.****
Is it hosted on Amazon? 99% - yes, because Im familiar with this error.
Don't use in AWS, but in physical machines. Its well known bug with old kernels(but I believe your system is updated and error occurs on IO) - Amazon gives bad hardware for Vertica images(especially hard drives).
Did you also face crash issue on any of the vertica node during the same time when these messages were reported in the messages file? Also this is a kernel crash issue but just wanted to understand did it cause any issue on the vertica cluster as well?
Regards
Rahul
furthermore, after it has happened we start up the server and everything was ok.
this is the kernel version:
2.6.32-358.el6.x86_64
maybe it will help you to find some reason for that.
thanks.
Thanks for the information.But since the heading says that one of the node suddenly got down that is the reason I asked for node crash issue.Can you kindly check in the vertica.log at the time node went down whether it went down due to a Panic message.Anything it wrote in the ErrorReport.txt?
Also can you kindly provide us the snippet of the dblog (can be found under catalog directory)?
Regards
Rahul
VerticaN2 kernel: INFO: task vertica:56998 blocked for more than 120 seconds.
All 10 servers of Vertica DB cluster are physical HP DL380p machines!
VerticaN2 kernel: INFO: task vertica:56998 blocked for more than 120 seconds message itself is a pure indicator that some kernel issues were there on the machine related to disk IO which led to vertica failure because it kept blocking vertica process from running.Since this failure is non-vertica issue therefore will request you to go ahead & follow it up with the customers IT/hardware team for the issue as they might have noticed something unusual during the same time.
Regards
Rahul