Last Good Epoch - RESTORE
I have clustered environment, and with certain frequently some node goes down, sometimes crashing database. To recover Vertica's database it is necessary restoring it to last good epoch (some days before), causing a gap on displayed data. Is there any way to recover database without losing this amount of data?
PS: I noticed that some large "core.*" files were created on problematic node
/data/drdata/v_drdata_node0002_catalog
[root@host v_drdata_node0002_catalog]# ls -lha
total 7.1G
drwx------ 9 dradmin dradmin 4.0K Mar 25 04:02 .
drwxrwxr-x 4 dradmin dradmin 4.0K Mar 25 04:02 ..
drwx------ 4 dradmin dradmin 4.0K Feb 11 18:14 Catalog
drwx------ 2 dradmin dradmin 4.0K Feb 11 18:14 CopyErrorLogs
-rw------- 1 dradmin dradmin 1.9G Mar 24 12:46 core.12425
-rw------- 1 dradmin dradmin 1.9G Mar 24 18:20 core.15732
-rw------- 1 dradmin dradmin 1.9G Mar 24 16:39 core.18424
-rw------- 1 dradmin dradmin 1.9G Mar 24 17:20 core.2524
drwxrwx--- 2 dradmin dradmin 68K Mar 25 11:54 DataCollector
-rw------- 1 dradmin dradmin 2.2K Feb 11 18:14 debug_log.conf
-rwx------ 1 dradmin dradmin 18K Mar 24 18:20 ErrorReport.txt
drwx------ 2 dradmin dradmin 4.0K Feb 11 18:14 Libraries
drwxrwxr-x 2 dradmin dradmin 4.0K Mar 24 01:33 Snapshots
drwx------ 2 dradmin dradmin 4.0K Feb 11 18:14 tmp
drwx------ 2 dradmin dradmin 4.0K Mar 25 04:02 UDxLogs
-rw------- 1 dradmin dradmin 984 Feb 20 10:56 vertica.conf
-rw------- 1 dradmin dradmin 113M Mar 25 11:54 vertica.log
-rw------- 1 dradmin dradmin 14M Mar 25 04:02 vertica.log.1.gz
-rw------- 1 dradmin dradmin 20M Mar 24 04:03 vertica.log.2.gz
-rw------- 1 dradmin dradmin 15M Mar 23 04:02 vertica.log.3.gz
-rw------- 1 dradmin dradmin 45 Mar 24 21:38 vertica.pid
Thanks in advance!
Comments
Hello Márcio,
A node failure should not cause a database failure and it shouldn't create core files. It seems that something particularly nasty is going on that is causing nodes to fail in a catatrophic manner. I would recommend investigating the failures instead of trying too hard to figure out how to gracefully recover from that scenario.
If for some reason a node failure does cause the database to have to stop, the individual nodes should shut down cleanly, which prevents the need to do "ASR" (restoring to a last good epoch).
Check out the Vertica logs and look for PANIC events.
- Derrick
Hi Márcio,
You said, you are seeing .core files on shutdown nodes.
It amybe a serious issue realted to corruption of binaries. Try investigating it further. Restoring to LGE should be the last option.
Hope this helps.
NC
Hi Derrick/Navin,
One node was not restored after an unexpected power down from VM where node is running, for this reason, it was necessary forcing a restart through the command below:
$admintools -t restart_node -F -s <this_Hostname_or_IP> -d <dbname>
After doing that on problematic node, all nodes goes down, and a message asking me to rollback to LGE was displayed.
I don't know another way to restore vertica's database afterwards being "crashed", is there another way to restore it?
PS: Old data are no longer being displayed on my frontend "CA Performance Center". Is there any way make sure that old data remains on database?
Please, excuse my basic English!
Thanks in advance!
"After doing that on problematic node, all nodes goes down"
Hmm. I don't believe that's intended behavior. We could investigate that further. If you have a technical support contract, I would urge you to use it in this scenario.
Are you able to start the database without that node? You may have to use the 'restart node' option of admintools on every node except that one.
If that works, then it implies you have k-safety (data redundancy). In that case, you can fully erase the problem node's metadata (catalog), causing it to join the cluster and recover from scratch. You would v_dbname_node000x_catalog/Catalog and then start the node. I encourage you to take a full filesystem backup of both _catalog and _data on the problem node before trying this.
Thoughts?
Unfortunately I haven't a technical support contract with HP. My vendor is "CA Technology" responsible to support "Performance Center", so... when I have some problem related to Vertica's database, I have to ask CA to open a ticket with HP.
No, it's not possible restart database without problematic node. I have to force an "restart_node" on node with state "DOWN" to be able to restore cluster, or, stop database through admintools option (4) Stop database
K-safe level is set to (1), as you can see below:
drdata=> SELECT MARK_DESIGN_KSAFE(1);
MARK_DESIGN_KSAFE
----------------------
Marked design 1-safe
(1 row)
... how are you running queries? I thought your database was down?
I would recommend you reach out to your vendor for assistance.
I'm running queries through vertica's database.
No, it's not down. Database crashed/restored on May 24th.
> restoring it to last good epoch (some days before)
This part is concerning. Do you mean that your LGE is days back, and that the recovery is forcing the database to go back a few days? If that's true then data is not making it from WOS to ROS. You probably have Stale Checkpoint events in the active_events table. You need to get to the bottom of why LGE is stuck back. Is moveout hitting "too many ROS containers"? Is there a very long-running mergeout blocking a moveout? Something else?
--Sharon