Spread failure
We've had spread fail a few times in the last few days (running 9.0.0-2) on AWS Linux vertica1 3.10.0-514.6.2.el7.x86_64 #1 SMP Thu Feb 23 03:04:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux.
The issues seem to be due to UDP related network issues. Any help in resolving would be appreciated.
The first time there were error messages in dmesg like:
[7773872.577752] UDP: bad checksum. From 10.0.4.19:48818 to 10.0.4.119:4803 ulen 1011
[7773913.775550] UDP: bad checksum. From 10.0.4.19:48818 to 10.0.4.119:4803 ulen 846
The second time there was an assert hit that cased a panic in Vertica:
2018-01-04 19:52:21.353 Spread Client:7f95b2ffd700 @v_verticadb_node0001: VX001/5422: VAssert(lzorc == 0 && lzo_len == slen) failed
LOCATION: void decompress(Basics::ByteBuffer&, Basics::ByteBuffer&), /scratch_a/release/svrtar21996/vbuild/vertica/Dist/VSpread.cpp:154
Backtrace Requested
(ZN6Basics9Backtrace11DoBacktraceEiiPvS1+0x89b) [0x36d84b5]
(ZN6Basics20GlobalSignalHandlers20logBacktraceFromHereEb+0xc7) [0x3729227]
(errthrow+0x4f) [0x2c985f9]
(vassert_internal_error+0x0) [0x3760f0e]
(_ZL10decompressRN6Basics10ByteBufferES1+0x396) [0x5e2178]
(_ZN4Dist7VSpread21processDistCallPacketEiSsSs+0x599) [0x5edd2f]
(_ZN4Dist7VSpread20handleRegularMessageEsi+0x5af) [0x5fc1b9]
(_ZN4Dist7VSpread13staticDequeueEv+0x9a0) [0x60541a]
(_ZN7Session13ThreadManager12launchThreadERKN5boost9function0IvEEPKc+0x41a) [0x359a0d4]
(thread_proxy+0x2f) [0x40b5bef]
(start_thread+0xc5) [0x7f96aab55e25]
(clone+0x6d) [0x7f96aa47734d]
END BACKTRACE
THREAD CONTEXT
Thread type: Spread Thread
Request: Unknown request
Transaction: [0x00a000000495be16]
END THREAD CONTEXT
Comments
Interesting. It looks like UDP packets are getting corrupted by the network. The assert failure is because a corrupted packet was successfully delivered to vertica, which detected the corruption when decompressing the message (we assume that lower levels of checksum in spread/transport layer prevent mistransmitted messages).
I would consider running some sort of UDP traffic analyzer on your cluster to see what is happening.