BUG: soft lockup - CPU for Vertica
Hello Everyone,
We recently encountered another outage on our on prem Vertica install. I was looking at the logs generated via scrutinize and one thing that I found interesting are these entries on the dmesg log
`[2002085.759439] Code: c0 00 4d 89 ee 48 89 4d b0 41 89 c5 eb 1d 90 49 83 c7 01 48 83 c3 40 4d 39 fc 0f 86 07 01 00 00 41 83 c5 01 4d 85 f6 4c 0f 44 f3 <8b> 43 18 83 f8 80 75 dc 48 8b 45 b8 0f b6 55 c0 48 8d 75 c8 4c
[2002113.703071] BUG: soft lockup - CPU#6 stuck for 22s! [vertica:31668]
[2002113.703461] Modules linked in: ppdev vmw_balloon crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd serio_raw pcspkr vmw_vmci i2c_piix4 shpchp parport_pc parport binfmt_misc dm_multipath ext4 mbcache jbd2 sd_mod sr_mod cdrom crc_t10dif ata_generic crct10dif_common pata_acpi mptspi drm_kms_helper scsi_transport_spi ttm mptscsih ata_piix drm libata mptbase vmxnet3 i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod
[2002113.706248] CPU: 6 PID: 31668 Comm: vertica Not tainted 3.10.0-229.14.1.el7.x86_64 #1
[2002113.706849] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[2002113.707550] task: ffff880fe4c40b60 ti: ffff880b2cdb0000 task.ti: ffff880b2cdb0000
[2002113.708141] RIP: 0010:[] [] compaction_alloc+0xf8/0x240
[2002113.708831] RSP: 0000:ffff880b2cdb3908 EFLAGS: 00000202
[2002113.709232] RAX: ffff88103ff9a6a0 RBX: 00000000005ac000 RCX: 0000000000000000
[2002113.709924] RDX: ffff88103ff9a000 RSI: ffff880b2cdb38c0 RDI: ffff88103ff9d068
[2002113.710570] RBP: ffff880b2cdb3948 R08: ffff880b2cdb3aa8 R09: ffff88103ff9d000
[2002113.711164] R10: 0000000000103a00 R11: 0000000001040000 R12: 000000001c998000
[2002113.711749] R13: 0000000001040000 R14: ffff88103ff9e008 R15: ffffffff81179ce9
[2002113.712335] FS: 00007fb0cba64700(0000) GS:ffff880fff2c0000(0000) knlGS:0000000000000000
[2002113.712947] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2002113.713309] CR2: 00007fa66e81d210 CR3: 0000000e3a9ac000 CR4: 00000000000407e0
[2002113.713899] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2002113.714485] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[2002113.715074] Stack:
[2002113.715324] 0000000000103a00 ffff88103ff9d000 0000000001040000 ffffea00040dfe40
[2002113.715933] ffff880b2cdb3a60 ffffea00040dfe00 ffffea00040dfe60 ffff880fe4c40b60
[2002113.716555] ffff880b2cdb39e8 ffffffff811b199e ffff880fe4c40b60 000000002cdb3aa8
`
CPU Details
` Static hostname: localhost.localdomainsudo -i
Icon name: computer-vm Chassis: vm Machine ID: 247cd847e16e4d5fa0b4fd08abe51193 Boot ID: 89f515ffa1e5466096df4cbc241497ad
Operating System: Red Hat Enterprise Linux Server 7.1 (Maipo)
CPE OS Name: cpe:/o:redhat:enterprise_linux:7.1:GA:server
Kernel: Linux 3.10.0-229.14.1.el7.x86_64
Architecture: x86_64
I'm not a linux guy and trying to do the RCA . Not sure if I could correlate the disk usage below to the 'CPU contention' issue above and the eventual vertica outage. Any other areas which are worth looking at this point?
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.25 0.01 0.49 0.47 4.29 19.08 0.00 2.69 3.01 2.69 0.58 0.03
sdb 0.00 7.20 0.01 6.64 0.58 169.66 51.17 0.02 3.73 10.28 3.72 0.29 0.19
dm-0 0.00 0.00 0.00 0.02 0.13 0.08 17.00 0.00 6.57 2.48 7.45 0.39 0.00
dm-1 0.00 0.00 0.00 0.37 0.22 2.45 14.38 0.00 3.25 3.48 3.25 0.41 0.02
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.66 0.66 0.00 0.52 0.00
dm-3 0.00 0.00 0.01 13.84 0.58 169.66 24.58 0.05 3.49 10.13 3.48 0.14 0.19
dm-4 0.00 0.00 0.00 0.00 0.01 0.09 77.10 0.00 8.06 3.48 8.51 0.49 0.00
dm-5 0.00 0.00 0.00 0.25 0.00 1.24 9.80 0.00 0.91 1.13 0.91 0.51 0.01
dm-6 0.00 0.00 0.00 0.03 0.01 0.04 3.04 0.00 1.37 8.46 1.35 0.49 0.00
dm-7 0.00 0.00 0.00 0.00 0.00 0.00 7.99 0.00 1.69 2.04 0.80 0.57 0.00
dm-8 0.00 0.00 0.00 0.06 0.08 0.38 15.46 0.00 3.98 4.27 3.98 0.33 0.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 2.33 0.00 2.00 0.00 17.33 17.33 0.00 0.33 0.00 0.33 0.33 0.07
sdb 0.00 10.67 0.00 1.00 0.00 48.00 96.00 0.00 1.67 0.00 1.67 1.67 0.17
dm-0 0.00 0.00 0.00 2.33 0.00 9.33 8.00 0.00 0.86 0.00 0.86 0.14 0.03
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 11.67 0.00 48.00 8.23 0.03 2.89 0.00 2.89 0.14 0.17
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 1.67 0.00 6.67 8.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-8 0.00 0.00 0.00 0.33 0.00 1.33 8.00 0.00 1.00 0.00 1.00 1.00 0.03
` iostat
Linux 3.10.0-229.14.1.el7.x86_64 (uspnsvulx162.test.ua3.eslabs.svcs.hpe.com) 06/28/2017 x86_64 (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.41 0.44 0.42 0.02 0.00 98.71
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.50 0.47 4.29 4560097 41696233
sdb 6.66 0.58 172.75 5664977 1679077828
dm-0 0.02 0.13 0.08 1229677 780460
dm-1 0.37 0.22 2.45 2132189 23845512
dm-2 0.00 0.00 0.00 888 0
dm-3 13.86 0.58 172.75 5664089 1679077828
dm-4 0.00 0.01 0.09 139505 921076
dm-5 0.25 0.00 1.24 29645 12086960
dm-6 0.03 0.01 0.04 126798 356285
dm-7 0.00 0.00 0.00 2253 884
dm-8 0.06 0.08 0.38 793169 3701944
`
Comments
It looks like a kernel bug. What is your kernel version on all nodes of vertica cluster?
Hi!
Known Errors in CentOS and RHEL 7: Vertica Impact
[...]
[...]
@rsalayo
Your kernel is: Linux 3.10.0-229.14.1.el7.x86_64