Data file corruption for large table

paulc · June 2017

Our Data Center had a power outage a couple weeks ago which caused corruption in a couple of our databases (1 postgres db and 1 Oracle db). Recently this manifested in our production Analytics data warehouse, which is Vertica.

error_messages.message gives me "Data file may be corrupt. Ensure that all hardware (disk and memory) is working properly. Possible solutions are to delete the file /var/lib/vertica/anly_dw/v_anly_dw_node0002_data/545/49539595941346545/49539595941346545_0.fdb while the node is down, and then allow the node to recover, or truncate the table data"

Truncating and restoring the table is an option, but it might take a couple days. I would prefer the first approach of shutting down the host, deleting the corrupted file and allowing an auto recover upon restart of the host. However, my first attempt at this failed, here is the log entry.

2017-06-12 18:41:07.179 Main:0x5f39b50 @v_anly_dw_node0002: VX001/2973: Data consistency problems found; startup aborted
HINT: Check that all file systems are properly mounted. Also, the --force option can be used to delete corrupted data and recover from the cluster
LOCATION: mainEntryPoint, /scratch_a/release/vbuild/vertica/Basics/vertica.cpp:1346

At this point I am afraid to use --force. This is the command I'm thinking of running: admintools -t restart_node -s [node_ip] -d [db_name] -p [password] -F. Will this really work and re-create the deleted .fdb file from nothing as part of the auto recover?

If this somehow hoses the cluster, I don't have much experience restoring Vertica databases. Would I need to put the original (corrupted) fdb file in it's original location and recover to the last restore point? I need something to fall back on if the --force/-F messes something up, so that concerns me.

My ultimate question is, will restart_node with -F option work if I drop the corrupt file? And if not, can I recover to the last restore point?

I have not looked at any hardware issues. That does not seem to be the problem here, though I'm not absolutely sure.

Thanks. Any help is greatly appreciated.

emoreno · June 2017

how many nodes? What version of Vertica? The force deletes the corrupted data but if you have corrupted data from the same projection in buddy node you may be in trouble. If you are a enterprise customer I recommend to open a support ticket, they can help you to restart the database with the minimum data loss.

RaghavA · June 2017

Hi Paulc,

i have also faced similar situation multiple times due to power failures. if you believe me, please go ahead and use the --force option to start the node and let it Auto-recover the corrupted ROS container.

It is safe and we have never faced any problem in v7.1.x, v7.2.x in using --force option to recover our node.

When power failure occurs, the ROS containers on which vertica must be writing becomes corrupted.

Regards,
Raghav Agrawal

paulc · June 2017

RaghavA - I appreciate your response. Did you have the identical situation as me, where an fdb file on 1 node went bad? Did you simply delete the corrupt file and restart the node with --force, on the command line? Does it recreate the fdb file as part of auto recover?

Sorry about my paranoia, but we are NOT paying for Vertica (3 nodes, Vertica 7.0). If this somehow fails and all data for the table becomes corrupt (as emoreno alluded to), I may have nothing to fall back on. Rebuilding the data warehouse from scratch would not be fun, or good for the business.

At this point our Sys Admin will be doing a hardware check on the box, so I'm not doing anything until that is done.

Thanks.

RaghavA · June 2017

Hi Paul,

if this is a 1-node setup, then it is a problem for you as --force option will not be able to recover the database. Then you have to seek help from vertica support if they have any internal, undocumented procedure to solve your problem.

If this is a 3-Node or bigger cluster having ksafe=1, then --force would be best option.

Yes, i have face the same situation approx 5 times in last 3 months in my 3-Node setup. The best approach would be:

1) Ask your sysadmin, to check the RAID and Disk health on your server.

2) if things are OK, then don't remove the corrupted .fdx file. Just try to start the node using --force option. Vertica will automatically remove corrupted file and recover it using buddy node. During Recovery phase, a new ROS will be created by Vertica.

You can keep a track on startup.log file on impacted node which is under your catalog directory.

Regards,
RaghavA

emoreno · June 2017

When you apply force it does remove the file and when it recovers it will recovery the files from the buddies nodes, but if the buddy node has corrupted data too, it has no place to recover.

It is not clear if the Database is UP or DOWN. If the database is UP you have only problems with that node and could use the force without hesitation.

As an alternative, you could use the CRCcheck option in vertica to check if you have corrupted files in the nodes. Check the doc :

https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/AdministratorsGuide/OperatingTheDatabase/IndexCRC/RunningTheCheckCRCOption.htm?Highlight=CRC

This function was introduced in 7.0 version of Vertica, if your data is coming from an old database that you updated it may not have the CRC index to check for corruption.

Hope this helps.
Eugenia
my 2 cents, for a production database, I recommend to backup the data and have a support subscription.

paulc · June 2017

The database is UP, so I guess that means the buddy node is ok. I will run the CheckCRC option nonetheless. That being said, I'm leaning towards trying --force.

A support subscription is probably not an option. Not sure how much it costs. I am primarily a Postgres DBA and I never needed support for it, but Vertica is a different animal.

It sounds like if I do a backup prior to trying the force I can always fall back to that if it fails. Is this true?

emoreno · June 2017

Run the checkCRC in the buddy nodes to see if the data is corrupted there. If not you could recovery from buddies without problems.

Eugenia

RaghavA · June 2017

Hi Paulc,

Taking a backup with corrupted ROS container is not a good option. But, taking a Hardlink backup could be an option.

But nothing to worry about, --force is safe to do it. Force option just pushes vertica to recover node in case ROS container has inconsistencies.

Regards,
Raghav Agrawal

paulc · June 2017

Force did not remove and re-create the corrupt data file, it left it in it's previous state, as re-verified by CheckCRC. Maybe it's a 7.0 vs 7.1+ thing (I'm using 7.0), or perhaps a configuration issue, but I had to manually delete the data file before using force. It worked, everything checks out.

Thanks for your help.

RaghavA · June 2017

Hi Paulc,

Yeah.. maybe 7.0 thing but gald to hear that your problem has been resolved.

Regards,
RaghavA

Rbaser · October 2017

Any suggestions ?

Data file corruption for large table

Comments

Leave a Comment