DB unresponsive after COPY
Vertica community edition v 7.1.1
When a (150GB) file is loaded in vsql (either with or without DIRECT directive) on a single node cluster an error occurs after all records have been loaded.
COPY sfkdba.fact_verstrekking_apotheek FROM '/tmp/fifo' DELIMITER E'\t';
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
After that admintool, the management center and vsql won't connect to the DB. What could be the cause of this error? What are basic trouble shooting steps?
When a (150GB) file is loaded in vsql (either with or without DIRECT directive) on a single node cluster an error occurs after all records have been loaded.
COPY sfkdba.fact_verstrekking_apotheek FROM '/tmp/fifo' DELIMITER E'\t';
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
After that admintool, the management center and vsql won't connect to the DB. What could be the cause of this error? What are basic trouble shooting steps?
0
Comments
Kindly let us know the current status of the db/cluser by providing the output of below command:
$ admintools -t view_cluster.
Also might have to look into vertica.log file (present under the catalog directory path) to check the exact reason behind it.
Regards
Rahul
Thanks for your reply. I am new to Vertica so I'm not sure whether I ran into a bug or did something wrong somewhere.
## Output from admintools -t view_cluster after the process failed:
DB | Host | State
-----+------+-------
sfk | ALL | DOWN
## The vertica.log repeatedly shows messages like:
----
LOCATION: initSession, /scratch_a/release/30493/vbuild/vertica/Session/ClientSession.cpp:436
2014-11-30 20:35:36.091 Init Session:0x7f33b00116a0 <LOG> @v_sfk_node0001: 00000/2705: Connection received: host=10.14.8.20 port=47170 (connCnt 4)
2014-11-30 20:35:36.091 Init Session:0x7f33b00116a0 <LOG> @v_sfk_node0001: 00000/4540: Received SSL negotiation startup packet
2014-11-30 20:35:36.091 Init Session:0x7f33b00116a0 <LOG> @v_sfk_node0001: 00000/4691: Sending SSL negotiation response 'N'
2014-11-30 20:35:36.091 Init Session:0x7f33b00116a0 <FATAL> @v_sfk_node0001: {SessionRun} 57V03/5785: Cluster Status Request by 10.14.8.20:47170
HINT: Cluster State: sfk
UP: 1 of 1 (v_sfk_node0001)
----
## After the fifo has finished, before the process fails, sorting starts:
dbadmin=> SELECT * FROM load_streams WHERE session_id = 'vertica.knmp.local-26716:0x124';
-[ RECORD 1 ]----------+-------------------------------
session_id | vertica.knmp.local-26716:0x124
transaction_id | 45035996273727737
statement_id | 1
stream_name |
schema_name | sfkdba
table_id | 45035996273740052
table_name | fact_verstrekking_apotheek
load_start | 2014-11-30 19:50:33.559189+
load_duration_ms |
is_executing | t
accepted_row_count | 967401777
rejected_row_count | 0
read_bytes | 0
input_file_size_bytes | 0
parse_complete_percent | 0
unsorted_row_count | 967401777
sorted_row_count | 212451788
sort_complete_percent | 21
## Then the sesssion is closed with an error
LOCATION: initSession, /scratch_a/release/30493/vbuild/vertica/Session/ClientSession.cpp:436
2014-11-30 20:36:02.003 SystemMonitor:0x7f33c0016880 <LOG> @v_sfk_node0001: 00000/5081: Total swap memory used: 3371008
2014-11-30 20:36:02.006 AnalyzeRowCount:0x7f33c0016a00-a00000000059a6 [Txn] <INFO> Begin Txn: a00000000059a6 'getRowCountsForProj'
2014-11-30 20:36:02.012 AnalyzeRowCount:0x7f33c0016a00-a00000000059a6 [Txn] <INFO> Rollback Txn: a00000000059a6 'getRowCountsForProj'
2014-11-30 20:36:02.015 AnalyzeRowCount:0x7f33c0016a00-a00000000059a7 [Txn] <INFO> Begin Txn: a00000000059a7 'do_analyze_row_count'
2014-11-30 20:36:02.018 AnalyzeRowCount:0x7f33c0016a00-a00000000059a7 [Command] <INFO> TMTask: row count analyze - start
2014-11-30 20:36:02.020 AnalyzeRowCount:0x7f33c0016a00-a00000000059a7 [Command] <INFO> TMTask: row count analyze - Done
2014-11-30 20:36:02.020 AnalyzeRowCount:0x7f33c0016a00-a00000000059a7 [Txn] <INFO> Rollback Txn: a00000000059a7 'do_analyze_row_count'
2014-11-30 20:36:02.021 AnalyzeRowCount:0x7f33c0016a00 [Util] <INFO> Task 'AnalyzeRowCount' enabled
2014-11-30 20:36:03.557 EEThread:0x7f33a011c4d0-a00000000058f9 [EE] <INFO> Writing... whole cost 219.27 sec, write cost 180.00 sec (pipe wait 22.73 sec), read cost 39.27 sec
2014-11-30 20:36:03.557 EEThread:0x7f33a011c4d0-a00000000058f9 [EE] <INFO> MergeHeap wait for child task 2.10342 sec,other cost 35.9165 sec.
child #0, wait + update 0sec, wait 0.333197. Total execution cost 15.0639
child #1, wait + update 0sec, wait 0.291. Total execution cost 14.6453
child #2, wait + update 0sec, wait 2.20567. Total execution cost 16.9543
child #3, wait + update 0sec, wait 0.300731. Total execution cost 14.8363
2014-11-30 20:36:21.921 EEcmdq:0x7f32b40a0f80 [SAL] <WARNING> LZO Decompression error: 4294967292 size: 40212
2014-11-30 20:36:21.921 EEcmdq:0x7f32b40a0f80 <PANIC> @v_sfk_node0001: VX001/5422: VAssert(decompressionSucceeded) failed
LOCATION: void EE::DecodeQueue::decodeDecompress(EE::DecodeQueue::WorkLoadInfo&), /scratch_a/release/30493/vbuild/vertica/EE/EEUtil/DecodeQueue.cpp:242
2014-11-30 20:36:21.924 EEcmdq:0x7f32b40a07c0 <PANIC> @v_sfk_node0001: VX001/5422: VAssert("Unknown block type" == 0) failed
LOCATION: static void EE::Decoder::setBlockIntoState(const Basics::Block*, EE::DecodeState&), /scratch_a/release/30493/vbuild/vertica/EE/EncDec/ColumnDecoding.h:129
2014-11-30 20:36:22.180 EEcmdq:0x7f32b40a1cb0 <PANIC> @v_sfk_node0001: VX001/5422: VAssert(icnt >= 0) failed
LOCATION: virtual void EE::DecodeQueue::processQueueItem(EE::InternalCommandQueue::QueueItem*), /scratch_a/release/30493/vbuild/vertica/EE/EEUtil/DecodeQueue.cpp:471
2014-11-30 20:36:22.575 EEcmdq:0x7f32b40c72a0 [Main] <INFO> Handling signal: 11
2014-11-30 20:36:22.575 EEcmdq:0x7f32b40c72a0 [Main] <INFO> Handling signal: 11
2014-11-30 20:36:28.221 EEcmdq:0x7f32b40a0f80 [Main] <PANIC> Wrote backtrace to ErrorReport.txt
### Contents of the file ErrorReport.txt:
BEGIN BACKTRACE
Vertica Backtrace at Sun Nov 30 20:36:28 2014
-------------------------
Vertica Analytic Database v7.1.1-0 $BrandId$
vertica(v7.1.1-0) built by release@build2.verticacorp.com from releases/VER_7_1_RELEASE_BUILD_1_0_20141016@148158 on 'Thu Oct 16 15:55:09 America/New_Y
ork 2014' $BuildId$
00400000-04e37000 r-xp 00000000 08:03 12976444 /opt/vertica/bin/vertica
05036000-05d92000 rw-p 04a36000 08:03 12976444 /opt/vertica/bin/vertica
05d92000-05eab000 rw-p 00000000 00:00 0
067a8000-06c68000 rw-p 00000000 00:00 0 [heap]
7f3459042000-7f345eed3000 r--p 00000000 08:03 7733806 /usr/lib/locale/locale-archive
7f345eed3000-7f345fe18000 r--s 00000000 08:03 12982731 /opt/vertica/share/icu/icudt42l.dat
7f345fe18000-7f346ff78000 rw-p 00000000 00:00 0
7f346ff78000-7f347037a000 rw-p 00000000 00:00 0
7f347037a000-7f3470386000 r-xp 00000000 08:03 9830430 /lib64/libnss_files-2.12.so
7f3470386000-7f3470586000 ---p 0000c000 08:03 9830430 /lib64/libnss_files-2.12.so
7f3470586000-7f3470587000 r--p 0000c000 08:03 9830430 /lib64/libnss_files-2.12.so
7f3470587000-7f3470588000 rw-p 0000d000 08:03 9830430 /lib64/libnss_files-2.12.so
7f3470588000-7f34705a5000 r-xp 00000000 08:03 9830456 /lib64/libtinfo.so.5.7
7f34705a5000-7f34707a5000 ---p 0001d000 08:03 9830456 /lib64/libtinfo.so.5.7
7f34707a5000-7f34707a9000 rw-p 0001d000 08:03 9830456 /lib64/libtinfo.so.5.7
7f34707a9000-7f34707bf000 r-xp 00000000 08:03 9830402 /lib64/libgcc_s-4.4.7-20120601.so.1
7f34707bf000-7f34709be000 ---p 00016000 08:03 9830402 /lib64/libgcc_s-4.4.7-20120601.so.1
7f34709be000-7f34709bf000 rw-p 00015000 08:03 9830402 /lib64/libgcc_s-4.4.7-20120601.so.1
7f34709bf000-7f34709d6000 r-xp 00000000 08:03 9830438 /lib64/libpthread-2.12.so
7f34709d6000-7f3470bd6000 ---p 00017000 08:03 9830438 /lib64/libpthread-2.12.so
7f3470bd6000-7f3470bd7000 r--p 00017000 08:03 9830438 /lib64/libpthread-2.12.so
7f3470bd7000-7f3470bd8000 rw-p 00018000 08:03 9830438 /lib64/libpthread-2.12.so
7f3470bd8000-7f3470bdc000 rw-p 00000000 00:00 0
7f3470bdc000-7f3470bde000 r-xp 00000000 08:03 9830420 /lib64/libdl-2.12.so
7f3470bde000-7f3470dde000 ---p 00002000 08:03 9830420 /lib64/libdl-2.12.so
7f3470dde000-7f3470ddf000 r--p 00002000 08:03 9830420 /lib64/libdl-2.12.so
7f3470ddf000-7f3470de0000 rw-p 00003000 08:03 9830420 /lib64/libdl-2.12.so
7f3470de0000-7f3470f6b000 r-xp 00000000 08:03 9830414 /lib64/libc-2.12.so
7f3470f6b000-7f347116a000 ---p 0018b000 08:03 9830414 /lib64/libc-2.12.so
7f347116a000-7f347116e000 r--p 0018a000 08:03 9830414 /lib64/libc-2.12.so
7f347116e000-7f347116f000 rw-p 0018e000 08:03 9830414 /lib64/libc-2.12.so
7f347116f000-7f3471174000 rw-p 00000000 00:00 0
7f3471174000-7f347117b000 r-xp 00000000 08:03 9830442 /lib64/librt-2.12.so
7f347117b000-7f347137a000 ---p 00007000 08:03 9830442 /lib64/librt-2.12.so
7f347137a000-7f347137b000 r--p 00006000 08:03 9830442 /lib64/librt-2.12.so
7f347137b000-7f347137c000 rw-p 00007000 08:03 9830442 /lib64/librt-2.12.so
7f347137c000-7f347139e000 r-xp 00000000 08:03 9830452 /lib64/libncurses.so.5.7
7f347139e000-7f347159d000 ---p 00022000 08:03 9830452 /lib64/libncurses.so.5.7
7f347159d000-7f347159e000 rw-p 00021000 08:03 9830452 /lib64/libncurses.so.5.7
7f347159e000-7f34715be000 r-xp 00000000 08:03 9830407 /lib64/ld-2.12.so
7f34717b3000-7f34717b8000 rw-p 00000000 00:00 0
7f34717bc000-7f34717bd000 rw-p 00000000 00:00 0
7f34717bd000-7f34717be000 r--p 0001f000 08:03 9830407 /lib64/ld-2.12.so
7f34717be000-7f34717bf000 rw-p 00020000 08:03 9830407 /lib64/ld-2.12.so
7f34717bf000-7f34717c0000 rw-p 00000000 00:00 0
7fffb1cad000-7fffb1cc3000 rw-p 00000000 00:00 0 [stack]
7fffb1cca000-7fffb1ccb000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Backtrace Requested
Take a look on these lines: Your table data is corrupted.
Regards, idiot.
The data isn't corrupt going in. I am sure because the data does load without a problem to multiple other DB systems. Any chance on narrowing down the cause of corruption with information available?
Paul
This isn't caused by data corruption in the data being loaded. This error means that Vertica internally wrote some bytes to disk; then it read those same bytes back (without modifying them) and got something back that wasn't what it wrote.
(If you simply load a corrupt data file with COPY, Vertica will either error out at load-time and/or succeed but with rejections output to flag corrupt rows, depending on the options you invoke COPY with.)
This is typically an issue with your filesystem or your physical hard drive. (If it's not a physical hard drive, odds are that's your issue. We're very good at tripping over bugs in network filesystems...) You could try running a filesystem check on the device; see what turns up.
Adam
While its LZO compression - is any way to disable CRC validation for LZO? May be by disabling it it will recover a most of data?
Thanks for your help. fsck reported that the fs was clean. To safe time, eager to do some performance testing, I chose to take the windows way out and reinstalled on a newly created (VM) disk. Loading and sorting the data was performed without any problem after that.
Paul
Unfortunately, this particular error generally means that the LZO data structures themselves have been corrupted. So there's not much hope of recovering that particular block of data... If you do a "SELECT * FROM t;" on the affected table, that should stream results up to the corrupt block.
Thanks.
J.P. van der Linden
Can you post your Vertica host details: is it VM and type of VM?
(don't use in Hyper-V, its buggy with Linux) So run SMART check and check for relocated sectors. Its also can be a reason.
Please post your HDD health. If SMART reports a lot of relocated sectors - so change your HDD.
http://en.wikipedia.org/wiki/S.M.A.R.T. PS
VM - do a hard usage in hdd, so don't use VM with SSD disks, otherwise it will kill your SSD disk.
The VM host used is Proxmox (qemu). Initially, when the error occurred, the disk was assigned to the SCSI bus. This disk is no longer available so SMART report can not be posted. In the second install I used VIRTIO.
Paul