DB unresponsive after COPY

Vertica community edition v 7.1.1

When a (150GB) file is loaded in vsql (either with or without DIRECT directive) on a single node cluster an error occurs after all records have been loaded.

COPY sfkdba.fact_verstrekking_apotheek FROM '/tmp/fifo' DELIMITER E'\t';
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

After that admintool, the management center and vsql won't connect to the DB. What could be the cause of this error? What are basic trouble shooting steps?

Comments

  • Hi J.P

    Kindly let us know the current status of the db/cluser by providing the output of below command:

    $ admintools -t view_cluster.

    Also might have to look into vertica.log file (present under the catalog directory path) to check the exact reason behind it.

    Regards
    Rahul
  • Hello Rahul,

    Thanks for your reply. I am new to Vertica so I'm not sure whether I ran into a bug or did something wrong somewhere.

    ## Output from admintools -t view_cluster after the process failed:

     DB  | Host | State
    -----+------+-------
     sfk | ALL  | DOWN 


    ## The vertica.log repeatedly shows messages like:
                    ----
            LOCATION:  initSession, /scratch_a/release/30493/vbuild/vertica/Session/ClientSession.cpp:436
    2014-11-30 20:35:36.091 Init Session:0x7f33b00116a0 <LOG> @v_sfk_node0001: 00000/2705: Connection received: host=10.14.8.20 port=47170 (connCnt 4)
    2014-11-30 20:35:36.091 Init Session:0x7f33b00116a0 <LOG> @v_sfk_node0001: 00000/4540: Received SSL negotiation startup packet
    2014-11-30 20:35:36.091 Init Session:0x7f33b00116a0 <LOG> @v_sfk_node0001: 00000/4691: Sending SSL negotiation response 'N'
    2014-11-30 20:35:36.091 Init Session:0x7f33b00116a0 <FATAL> @v_sfk_node0001: {SessionRun} 57V03/5785: Cluster Status Request by 10.14.8.20:47170
            HINT:  Cluster State: sfk
                    UP: 1 of 1 (v_sfk_node0001)
                    ----

    ## After the fifo has finished, before the process fails, sorting starts:
    dbadmin=> SELECT * FROM load_streams WHERE session_id = 'vertica.knmp.local-26716:0x124';
    -[ RECORD 1 ]----------+-------------------------------
    session_id             | vertica.knmp.local-26716:0x124
    transaction_id         | 45035996273727737
    statement_id           | 1
    stream_name            |
    schema_name            | sfkdba
    table_id               | 45035996273740052
    table_name             | fact_verstrekking_apotheek
    load_start             | 2014-11-30 19:50:33.559189+
    load_duration_ms       |
    is_executing           | t
    accepted_row_count     | 967401777
    rejected_row_count     | 0
    read_bytes             | 0
    input_file_size_bytes  | 0
    parse_complete_percent | 0
    unsorted_row_count     | 967401777
    sorted_row_count       | 212451788
    sort_complete_percent  | 21


    ## Then the sesssion is closed with an error
            LOCATION:  initSession, /scratch_a/release/30493/vbuild/vertica/Session/ClientSession.cpp:436
    2014-11-30 20:36:02.003 SystemMonitor:0x7f33c0016880 <LOG> @v_sfk_node0001: 00000/5081: Total swap memory used: 3371008
    2014-11-30 20:36:02.006 AnalyzeRowCount:0x7f33c0016a00-a00000000059a6 [Txn] <INFO> Begin Txn: a00000000059a6 'getRowCountsForProj'
    2014-11-30 20:36:02.012 AnalyzeRowCount:0x7f33c0016a00-a00000000059a6 [Txn] <INFO> Rollback Txn: a00000000059a6 'getRowCountsForProj'
    2014-11-30 20:36:02.015 AnalyzeRowCount:0x7f33c0016a00-a00000000059a7 [Txn] <INFO> Begin Txn: a00000000059a7 'do_analyze_row_count'
    2014-11-30 20:36:02.018 AnalyzeRowCount:0x7f33c0016a00-a00000000059a7 [Command] <INFO> TMTask: row count analyze - start
    2014-11-30 20:36:02.020 AnalyzeRowCount:0x7f33c0016a00-a00000000059a7 [Command] <INFO> TMTask: row count analyze - Done
    2014-11-30 20:36:02.020 AnalyzeRowCount:0x7f33c0016a00-a00000000059a7 [Txn] <INFO> Rollback Txn: a00000000059a7 'do_analyze_row_count'
    2014-11-30 20:36:02.021 AnalyzeRowCount:0x7f33c0016a00 [Util] <INFO> Task 'AnalyzeRowCount' enabled
    2014-11-30 20:36:03.557 EEThread:0x7f33a011c4d0-a00000000058f9 [EE] <INFO> Writing... whole cost 219.27 sec, write cost 180.00 sec (pipe wait 22.73 sec), read cost 39.27 sec
    2014-11-30 20:36:03.557 EEThread:0x7f33a011c4d0-a00000000058f9 [EE] <INFO> MergeHeap wait for child task 2.10342 sec,other cost 35.9165 sec.
      child #0, wait + update 0sec, wait 0.333197. Total execution cost 15.0639
      child #1, wait + update 0sec, wait 0.291. Total execution cost 14.6453
      child #2, wait + update 0sec, wait 2.20567. Total execution cost 16.9543
      child #3, wait + update 0sec, wait 0.300731. Total execution cost 14.8363
    2014-11-30 20:36:21.921 EEcmdq:0x7f32b40a0f80 [SAL] <WARNING> LZO Decompression error: 4294967292 size: 40212
    2014-11-30 20:36:21.921 EEcmdq:0x7f32b40a0f80 <PANIC> @v_sfk_node0001: VX001/5422: VAssert(decompressionSucceeded) failed
            LOCATION:  void EE::DecodeQueue::decodeDecompress(EE::DecodeQueue::WorkLoadInfo&), /scratch_a/release/30493/vbuild/vertica/EE/EEUtil/DecodeQueue.cpp:242
    2014-11-30 20:36:21.924 EEcmdq:0x7f32b40a07c0 <PANIC> @v_sfk_node0001: VX001/5422: VAssert("Unknown block type" == 0) failed
            LOCATION:  static void EE::Decoder::setBlockIntoState(const Basics::Block*, EE::DecodeState&), /scratch_a/release/30493/vbuild/vertica/EE/EncDec/ColumnDecoding.h:129
    2014-11-30 20:36:22.180 EEcmdq:0x7f32b40a1cb0 <PANIC> @v_sfk_node0001: VX001/5422: VAssert(icnt >= 0) failed
            LOCATION:  virtual void EE::DecodeQueue::processQueueItem(EE::InternalCommandQueue::QueueItem*), /scratch_a/release/30493/vbuild/vertica/EE/EEUtil/DecodeQueue.cpp:471
    2014-11-30 20:36:22.575 EEcmdq:0x7f32b40c72a0 [Main] <INFO> Handling signal: 11
    2014-11-30 20:36:22.575 EEcmdq:0x7f32b40c72a0 [Main] <INFO> Handling signal: 11
    2014-11-30 20:36:28.221 EEcmdq:0x7f32b40a0f80 [Main] <PANIC> Wrote backtrace to ErrorReport.txt


    ### Contents of the file ErrorReport.txt:

    BEGIN BACKTRACE
    Vertica Backtrace at Sun Nov 30 20:36:28 2014
    -------------------------
    Vertica Analytic Database v7.1.1-0 $BrandId$
    vertica(v7.1.1-0) built by release@build2.verticacorp.com from releases/VER_7_1_RELEASE_BUILD_1_0_20141016@148158 on 'Thu Oct 16 15:55:09 America/New_Y
    ork 2014' $BuildId$
    00400000-04e37000 r-xp 00000000 08:03 12976444                           /opt/vertica/bin/vertica
    05036000-05d92000 rw-p 04a36000 08:03 12976444                           /opt/vertica/bin/vertica
    05d92000-05eab000 rw-p 00000000 00:00 0
    067a8000-06c68000 rw-p 00000000 00:00 0                                  [heap]
    7f3459042000-7f345eed3000 r--p 00000000 08:03 7733806                    /usr/lib/locale/locale-archive
    7f345eed3000-7f345fe18000 r--s 00000000 08:03 12982731                   /opt/vertica/share/icu/icudt42l.dat
    7f345fe18000-7f346ff78000 rw-p 00000000 00:00 0
    7f346ff78000-7f347037a000 rw-p 00000000 00:00 0
    7f347037a000-7f3470386000 r-xp 00000000 08:03 9830430                    /lib64/libnss_files-2.12.so
    7f3470386000-7f3470586000 ---p 0000c000 08:03 9830430                    /lib64/libnss_files-2.12.so
    7f3470586000-7f3470587000 r--p 0000c000 08:03 9830430                    /lib64/libnss_files-2.12.so
    7f3470587000-7f3470588000 rw-p 0000d000 08:03 9830430                    /lib64/libnss_files-2.12.so
    7f3470588000-7f34705a5000 r-xp 00000000 08:03 9830456                    /lib64/libtinfo.so.5.7
    7f34705a5000-7f34707a5000 ---p 0001d000 08:03 9830456                    /lib64/libtinfo.so.5.7
    7f34707a5000-7f34707a9000 rw-p 0001d000 08:03 9830456                    /lib64/libtinfo.so.5.7
    7f34707a9000-7f34707bf000 r-xp 00000000 08:03 9830402                    /lib64/libgcc_s-4.4.7-20120601.so.1
    7f34707bf000-7f34709be000 ---p 00016000 08:03 9830402                    /lib64/libgcc_s-4.4.7-20120601.so.1
    7f34709be000-7f34709bf000 rw-p 00015000 08:03 9830402                    /lib64/libgcc_s-4.4.7-20120601.so.1
    7f34709bf000-7f34709d6000 r-xp 00000000 08:03 9830438                    /lib64/libpthread-2.12.so
    7f34709d6000-7f3470bd6000 ---p 00017000 08:03 9830438                    /lib64/libpthread-2.12.so
    7f3470bd6000-7f3470bd7000 r--p 00017000 08:03 9830438                    /lib64/libpthread-2.12.so
    7f3470bd7000-7f3470bd8000 rw-p 00018000 08:03 9830438                    /lib64/libpthread-2.12.so
    7f3470bd8000-7f3470bdc000 rw-p 00000000 00:00 0
    7f3470bdc000-7f3470bde000 r-xp 00000000 08:03 9830420                    /lib64/libdl-2.12.so
    7f3470bde000-7f3470dde000 ---p 00002000 08:03 9830420                    /lib64/libdl-2.12.so
    7f3470dde000-7f3470ddf000 r--p 00002000 08:03 9830420                    /lib64/libdl-2.12.so
    7f3470ddf000-7f3470de0000 rw-p 00003000 08:03 9830420                    /lib64/libdl-2.12.so
    7f3470de0000-7f3470f6b000 r-xp 00000000 08:03 9830414                    /lib64/libc-2.12.so
    7f3470f6b000-7f347116a000 ---p 0018b000 08:03 9830414                    /lib64/libc-2.12.so
    7f347116a000-7f347116e000 r--p 0018a000 08:03 9830414                    /lib64/libc-2.12.so
    7f347116e000-7f347116f000 rw-p 0018e000 08:03 9830414                    /lib64/libc-2.12.so
    7f347116f000-7f3471174000 rw-p 00000000 00:00 0
    7f3471174000-7f347117b000 r-xp 00000000 08:03 9830442                    /lib64/librt-2.12.so
    7f347117b000-7f347137a000 ---p 00007000 08:03 9830442                    /lib64/librt-2.12.so
    7f347137a000-7f347137b000 r--p 00006000 08:03 9830442                    /lib64/librt-2.12.so
    7f347137b000-7f347137c000 rw-p 00007000 08:03 9830442                    /lib64/librt-2.12.so
    7f347137c000-7f347139e000 r-xp 00000000 08:03 9830452                    /lib64/libncurses.so.5.7
    7f347139e000-7f347159d000 ---p 00022000 08:03 9830452                    /lib64/libncurses.so.5.7
    7f347159d000-7f347159e000 rw-p 00021000 08:03 9830452                    /lib64/libncurses.so.5.7
    7f347159e000-7f34715be000 r-xp 00000000 08:03 9830407                    /lib64/ld-2.12.so
    7f34717b3000-7f34717b8000 rw-p 00000000 00:00 0
    7f34717bc000-7f34717bd000 rw-p 00000000 00:00 0
    7f34717bd000-7f34717be000 r--p 0001f000 08:03 9830407                    /lib64/ld-2.12.so
    7f34717be000-7f34717bf000 rw-p 00020000 08:03 9830407                    /lib64/ld-2.12.so
    7f34717bf000-7f34717c0000 rw-p 00000000 00:00 0
    7fffb1cad000-7fffb1cc3000 rw-p 00000000 00:00 0                          [stack]
    7fffb1cca000-7fffb1ccb000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

    Backtrace Requested
  • Hi!

    Take a look on these lines:
    2014-11-30 20:36:21.921 EEcmdq:0x7f32b40a0f80 [SAL] <WARNING> LZO Decompression error:

    ...
    2014-11-30 20:36:21.921 EEcmdq:0x7f32b40a0f80 <PANIC> @v_sfk_node0001: VX001/5422: VAssert(decompressionSucceeded) failed
    Your table data is corrupted.

    Regards, idiot.
  • Thanks you Id10t,

    The data isn't corrupt going in. I am sure because the data does load without a problem to multiple other DB systems. Any chance on narrowing down the cause of corruption with information available?

    Paul
  • Hi Paul,

    This isn't caused by data corruption in the data being loaded.  This error means that Vertica internally wrote some bytes to disk; then it read those same bytes back (without modifying them) and got something back that wasn't what it wrote.

    (If you simply load a corrupt data file with COPY, Vertica will either error out at load-time and/or succeed but with rejections output to flag corrupt rows, depending on the options you invoke COPY with.)

    This is typically an issue with your filesystem or your physical hard drive.  (If it's not a physical hard drive, odds are that's your issue.  We're very good at tripping over bugs in network filesystems...)  You could try running a filesystem check on the device; see what turns up.

    Adam
  • Hi!
    The data isn't corrupt going in.
    I didn't say that data is corrupted, I said that table data is corrupted, i.e. data on HDD (not Vertica ROS containers, but data)
    This is typically an issue with your filesystem or your physical hard drive.
    Agree with Adam, run fsck and define filesystem checks at reboot time in fstab:
    man fstab
    The sixth field (fs_passno).
    This field is used by the fsck(8) program to determine the order in which filesystem checks are done at reboot time. The root filesystem should be specified with a fs_passno of 1, and other filesystems should have a fs_passno of 2. Filesystems within a drive will be checked sequentially, but filesystems on different drives will be checked at the same time to utilize parallelism available in the hardware. If the sixth field is not present or zero, a value of zero is returned and fsck will assume that the filesystem does not need to be checked.
  • @adam

    While its LZO compression - is any way to disable CRC validation for LZO? May be by disabling it it will recover a most of data?
  • Hello Adam, Id10t,

    Thanks for your help. fsck reported that the fs was clean. To safe time, eager to do some performance testing, I chose to take the windows way out and reinstalled on a newly created (VM) disk. Loading and sorting the data was performed without any problem after that.

    Paul
  • @id10t

    Unfortunately, this particular error generally means that the LZO data structures themselves have been corrupted.  So there's not much hope of recovering that particular block of data...  If you do a "SELECT * FROM t;" on the affected table, that should stream results up to the corrupt block.
  • @adam
    Thanks.

    J.P. van der Linden
    Can you post your Vertica host details: is it VM and type of VM?
    (don't use in Hyper-V, its buggy with Linux)
    fsck reported that the fs was clean
    So run SMART check and check for relocated sectors. Its also can be a reason.
    Please post your HDD health. If SMART reports a lot of relocated sectors - so change your HDD.

    http://en.wikipedia.org/wiki/S.M.A.R.T.
    S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system for computer hard disk drives (HDDs) and solid-state drives (SSDs)[1] that detects and reports on various indicators of drive reliability, aiming to anticipate hardware failures.
    PS
    VM - do a hard usage in hdd, so don't use VM with SSD disks, otherwise it will kill your SSD disk.
  • Hello Id10t,

    The VM host used is Proxmox (qemu). Initially, when the error occurred, the disk was assigned to the SCSI bus. This disk is no longer available so SMART report can not be posted. In the second install I used VIRTIO.

    Paul
  • no disk - no comments.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file