Mergeout : Read failed in FileColumnreader

Hi,
Mergeout is not always successful in one of the node due to below error,
2018-01-04 20:09:20.189 TM Mergeout(00):0x7f50e4027140-2d0000000c84a76 [Txn] Rollback Txn: 2d0000000c84a76 'Mergeout: (Table: cdvrap_usr.TableA) (Projection: cdvrap_usr.Table_pr1)'
2018-01-04 20:09:20.195 TM Mergeout(00):0x7f50e4027140 @v_svcrap_node0036: 00000/3298: Event Posted: Event Code:14 Event Id:786949 Event Severity: Warning [4] PostedTimestamp: 2018-01-04 20:09:20.195081 ExpirationTimestamp: 2018-01-04 20:09:35.195081 EventCodeDescription: Timer Service Task Error ProblemDescription: threadShim: Read failed in FileColumnReader: /mnt/verticadata/svcrap/v_svcrap_node0036_data/315/02457d3afc004d8511c41528c65199d802d0000184d653eb_0.gt Input/output error DatabaseName: svcrap
2018-01-04 20:09:20.195 TM Mergeout(00):0x7f50e4027140 @v_svcrap_node0036: {threadShim} 58030/4520: Read failed in FileColumnReader: /mnt/verticadata/svcrap/v_svcrap_node0036_data/315/02457d3afc004d8511c41528c65199d802d0000184d653eb_0.gt Input/output error
LOCATION: getBlockWithPrefetch, /scratch_a/release/svrtar21996/vbuild/vertica/SAL/FileColumnReader.cpp:242

Getting IO error in this file,
02457d3afc004d8511c41528c65199d802d0000184d653eb_0.gt

Please let me know how to solve this issue

Comments

  • Does the file exist? What file size does it have? Is the file size close to the result of
    select sum(size) from v_internal.vs_ros where salstorageid = '02457d3afc004d8511c41528c65199d802d0000184d653eb';

    Can you actually read the file manually? For example,
    md5sum /mnt/verticadata/svcrap/v_svcrap_node0036_data/315/02457d3afc004d8511c41528c65199d802d0000184d653eb_0.gt

    It's possible the file is inaccessible due to a bad sector on disk. If so, if you have it in a backup or on another node, you can copy it over. Or you could delete the file and restart the node, counting on Vertica's recovery process to repair the disk contents (obviously, this mechanism is more dangerous than the previous suggestions - ensure you have a hard-link-local backup before trying).

  • Hi Ben,
    Yes, the file exists in the path. and the size of the file from the query is 3995284 and the file size is 3.82 MB

    I don't have a backup of this file. If I delete the file & restart the node, is there is a possibility that I will tend to lose some data exists in the file.

    Do I need to take a back up of the node before restarting?

    Will it become an issue if I leave the file like that?

  • Can I get an update on this.

  • Jim_KnicelyJim_Knicely Administrator

    You should be able to figure out which table is impacted:

    select distinct p.projection_schema anchor_table_schema, p.anchor_table_name from storage_containers sc join projections p on p.projection_id = sc.projection_id where right(storage_oid::varchar, 3) = '315' and sal_storage_id = '02457d3afc004d8511c41528c65199d802d0000184d653eb';
    

    Then maybe make a copy of the table using a CTAS statement..

    Then rename the file above as opposed to deleting it...

    Then run the RESTART_THIS_NODE command on the affected node and hopefully, as Ben stated, Vertica will recover for you.

    Ben, what are you thoughts?

  • If you rename the file, be sure to rename it outside of the data/ directory - otherwise startup may clean it up.
    If the table is small, you can use CTAS to rewrite it and drop the original.
    Are the buddies of node 35 healthy? (likely node 34 and node 36)
    You can create a hard link local backup with:
    select database_snapshot('backup',true);
    This won't much disk space. After everything is straightened out, remember to clean up the snapshot with:
    select remove_database_snapshot('backup');

    Did you ever try running md5sum or some other program to read the file contents? I'm concerned that if it's a bad disk, you will paper over the problem only to have it reappear later in another file.

  • Jim_KnicelyJim_Knicely Administrator
    edited January 2018

    Fyi ... Attached is a simple test that shows that Vertica can recover a file that is deleted. Note that the node has to be brought up with the -F or --force option to delete the corrupted data and recover from the cluster.

  • Jim_KnicelyJim_Knicely Administrator

    Attached is another example like the previous one, except this one uses a segmented table, whereas the first example used a replicated table.

  • Hi Ben / Jim,
    There are no records found for the below query, where as the file exists in the above mentioned path,
    select distinct p.projection_schema anchor_table_schema, p.anchor_table_name from storage_containers sc join projections p on p.projection_id = sc.projection_id where right(storage_oid::varchar, 3) = '315' and sal_storage_id = '02457d3afc004d8511c41528c65199d802d0000184d653eb';

    Does this mean the table linked to the file was deleted ?

  • Jim_KnicelyJim_Knicely Administrator
    edited January 2018

    What is the storage id?

    select p.node_name, p.projection_schema anchor_table_schema, p.anchor_table_name, storage_oid from storage_containers sc join projections p on p.projection_id = sc.projection_id where sal_storage_id = '02457d3afc004d8511c41528c65199d802d0000184d653eb' order by 1;

  • Yes, it exists.

  • Hi Jim,
    Storage ID is 315. But, no records returned from above query by passing the sal_storage_id as filename '02457d3afc004d8511c41528c65199d802d0000184d653eb' .

    But the file exists .

  • And, there are no records for below query as well,
    select * from storage_containers WHERE sal_storage_id = '02457d3afc004d8511c41528c65199d802d0000184d653eb';

  • Hi Ben,
    I am getting IO error when I tried to read the file.
    md5sum: /mnt/verticadata/svcrap/v_svcrap_node0036_data/315/02457d3afc004d8511c41528c65199d802d0000184d653eb_0.gt: Input/output error

  • Jim_KnicelyJim_Knicely Administrator
    edited January 2018

    If there are no records with that sal_storage_id in storage_containers, then there should be no tables associated with it.

    If you are getting an IO error when reading that file then its most like a bad sector on disk. I'd look to replace that disk ASAP.

    Also, I just noticed the table name in your original post is "cdvrap_usr.TableA" and the projection name is "cdvrap_usr.Table_pr1".

    I'd make a copy of that table right away!

    You should be able to find the storage location info for it:

     select node_name, storage_oid, sal_storage_id, total_row_count from storage_containers where schema_name = 'cdvrap_usr' and projection_name = 'Table_pr1';
    

    See if you can do a manual Moveout:

    SELECT do_tm_task('moveout','cdvrap_usr.TableA');

  • Is it possible to find what tables are affected with the bad sector?
    Once I have the list of tables, I can see those tables are in use or not ?
    Assume if the table is not being used and we can truncate the table. Will this problem be resolved?
    Or the only option is to replace the node ? will restart the node won't work?

  • If the disk is bad, this is a hardware problem and you should address the hardware problem. Replace the disk / node and let Vertica recover the data is a reasonable approach.
    Truncating the affected tables without resolving the hardware problem I believe just sets you up for future pain. If you address the hardware issue and recover the node then no truncation is necessary.

  • Got it .. Thanks..

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file