We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Spread failure — Vertica Forum

Spread failure

We've had spread fail a few times in the last few days (running 9.0.0-2) on AWS Linux vertica1 3.10.0-514.6.2.el7.x86_64 #1 SMP Thu Feb 23 03:04:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux.

The issues seem to be due to UDP related network issues. Any help in resolving would be appreciated.

The first time there were error messages in dmesg like:
[7773872.577752] UDP: bad checksum. From 10.0.4.19:48818 to 10.0.4.119:4803 ulen 1011
[7773913.775550] UDP: bad checksum. From 10.0.4.19:48818 to 10.0.4.119:4803 ulen 846

The second time there was an assert hit that cased a panic in Vertica:
2018-01-04 19:52:21.353 Spread Client:7f95b2ffd700 @v_verticadb_node0001: VX001/5422: VAssert(lzorc == 0 && lzo_len == slen) failed
LOCATION: void decompress(Basics::ByteBuffer&, Basics::ByteBuffer&), /scratch_a/release/svrtar21996/vbuild/vertica/Dist/VSpread.cpp:154

Backtrace Requested
(ZN6Basics9Backtrace11DoBacktraceEiiPvS1+0x89b) [0x36d84b5]
(ZN6Basics20GlobalSignalHandlers20logBacktraceFromHereEb+0xc7) [0x3729227]
(errthrow+0x4f) [0x2c985f9]
(vassert_internal_error+0x0) [0x3760f0e]
(_ZL10decompressRN6Basics10ByteBufferES1
+0x396) [0x5e2178]
(_ZN4Dist7VSpread21processDistCallPacketEiSsSs+0x599) [0x5edd2f]
(_ZN4Dist7VSpread20handleRegularMessageEsi+0x5af) [0x5fc1b9]
(_ZN4Dist7VSpread13staticDequeueEv+0x9a0) [0x60541a]
(_ZN7Session13ThreadManager12launchThreadERKN5boost9function0IvEEPKc+0x41a) [0x359a0d4]
(thread_proxy+0x2f) [0x40b5bef]
(start_thread+0xc5) [0x7f96aab55e25]
(clone+0x6d) [0x7f96aa47734d]
END BACKTRACE
THREAD CONTEXT
Thread type: Spread Thread
Request: Unknown request
Transaction: [0x00a000000495be16]
END THREAD CONTEXT

Comments

  • Interesting. It looks like UDP packets are getting corrupted by the network. The assert failure is because a corrupted packet was successfully delivered to vertica, which detected the corruption when decompressing the message (we assume that lower levels of checksum in spread/transport layer prevent mistransmitted messages).

    I would consider running some sort of UDP traffic analyzer on your cluster to see what is happening.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file