Cloud/virtualized cluster node down, with processes blocked in the kernel

[Deleted User] · February 2013

Node(s) in my cloud/virtual cluster went down. When I examined system logs on the servers (/var/log/messages), I saw errors like these: Jul 24 13:21:17 ip-x-x-x-x kernel: INFO: task syslogd:1393 blocked for more than 120 seconds. Jul 24 13:21:17 ip-x-x-x-x kernel: INFO: task xfssyncd:942 blocked for more than 120 seconds. Is there a configuration problem with Vertica?

[Deleted User] · February 2013

Those messages indicate that something in the kernel (kernel code itself, or more likely, a driver for a device or filesystem) has caused a system call to block for 120 seconds. This is well outside of necessary response time for system calls. The calling process may well encounter subsequent problems due to the slow service from the kernel. We can think of this problem as one of service level guarantees. The kernel, Vertica, and spread (and all processes) all expect timely completion of disk and network IO requests, and requests on related resources such as the underlying kernel locks. When this is not provided, operations will start to fail. Issues arise in cloud or virtualized environments when the virtualized IO operations respond well outside the normal expectations based on a bare-metal configuration. For this reason we do not recommend production deployment on cloud and virtualized environments.

Cloud/virtualized cluster node down, with processes blocked in the kernel

Comments

Leave a Comment