file got deleted on one node

saumya_tamma · January 2014

Hi,

We are testing multiple backup and restore scenarios.

In one of the scenario, we are testing to see if the recovery is automatic when we have a 3 node cluster and a datafile gets deleted on one node.

Here is what we did:

1) created a table and noted the timestamp on the datafiles that got created on one node (say node2)

dbadmin=> create table testing_df_delete (n number);

CREATE TABLE

2) removed the files on node 2 which were recently created (4 .fdb and .idx files)

3) checked that the table existed on node 1

dbadmin=> select * from testing_df_delete;

n

---

1

4) Checked that the table is not there in node 2

dbadmin=> select * from testing_df_delete;

ERROR 3413: FileColumnReader: unable to open position index /data/VERLABQA01/v_verlabqa01_node0002_data/201/49539595901177201/49539595901177201_0.pidx: No such file or directory

5) Brought down the node2 by doing a ps -ef |grep vertica and killing the session
6) Tried restarting node 2 but it doesnt come up with the error as shown below

[dbadmin@genalblabdb07n2 v_verlabqa01_node0002_catalog]$ tail -f vertica.log
2014-01-13 18:38:27.831 nameless:0x5ee14a0 [Catalog] <WARNING> Error getting size of file [/data/VERLABQA01/v_verlabqa01_node0002_data/201/49539595901125201/49539595901125201_0.fdb]: No such file or directory
2014-01-13 18:38:27.831 nameless:0x5ee14a0 [Catalog] <WARNING> Error getting size of file [/data/VERLABQA01/v_verlabqa01_node0002_data/205/49539595901125205/49539595901125205_0.fdb]: No such file or directory
2014-01-13 18:38:27.833 nameless:0x5ee1960 [Catalog] <WARNING> Error getting size of file [/data/VERLABQA01/v_verlabqa01_node0002_data/201/49539595901177201/49539595901177201_0.fdb]: No such file or directory
2014-01-13 18:38:27.833 nameless:0x5ee1960 [Catalog] <WARNING> Error getting size of file [/data/VERLABQA01/v_verlabqa01_node0002_data/205/49539595901177205/49539595901177205_0.fdb]: No such file or directory
2014-01-13 18:38:27.833 Main:0x5b456d0 [Recover] <INFO> Loading UDx libraries
2014-01-13 18:38:27.833 Main:0x5b456d0 [Recover] <INFO> Setting up UDx pointers
2014-01-13 18:38:27.834 Main:0x5b456d0 <PANIC> @v_verlabqa01_node0002: VX001/2973: Data consistency problems found; startup aborted
HINT: Check that all file systems are properly mounted. Also, the --force option can be used to delete corrupted data and recover from the cluster
LOCATION: mainEntryPoint, /scratch_a/release/vbuild/vertica/Basics/vertica.cpp:1166
2014-01-13 18:38:27.907 Main:0x5b456d0 [Main] <PANIC> Wrote backtrace to ErrorReport.txt

My question is: As this is a 3 node cluster wouldnt the files be created automatically on restart as a recovery process?

Is the solution only to restore the latest backup on node2 and then see if it recovers?

In such cases how do we bring up the cluster again to its working state?

Thanks
SAumya

Abhishek_Rana · January 2014

Hi,

Catalog tries to match for actual size of the file & as it is not present, it is showing warnings regarding consistency of files.

You can recover this Node- if it is K-safe , as below:

Restart Node 2 forcefully as below:
---------------------------------------------

$admintools -t restart_node -F -s <this_Hostname_or_IP> -d <dbname>

In this case, files will be copied from other UP nodes & node is recovered.

Regards'

Abhishek

saumya_tamma · January 2014

Thanks Abhishek, this helps.

I have one more question, what if we drop the schema from the database and want to restore it and I have only a full database backup present and no object level backup.

How can we recover it using that full database backup?

Because as of now I see the error stating some files do not exist when I am trying to do that by modifying the configuration file to just include the schema objects.

Is this even possible? Is it always necessary to have an object level backup and full database backup?

[dbadmin@genalblabdb07n1 config]$ /opt/vertica/bin/vbr.py --task restore --config-file /opt/vertica/config/Full_Jan_2.ini
Preparing...
Found Database port: 5433
Copying...
25333: vbr client subproc on 15.224.232.116 terminates with returncode 1. Details in vbr_v_verlabqa01_node0002_client.log on that host.
Error msg: rsync: link_stat "/data/backups/genalblabdb07n1/v_verlabqa01_node0002/new_Jan_2_VERLABQA01/new_Jan_2_VERLABQA01.rst" (in vbr) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1508) [generator=3.0.7]
rsync failed!

25332: vbr client subproc on 15.224.232.115 terminates with returncode 1. Details in vbr_v_verlabqa01_node0001_client.log on that host.
Error msg:
rsync: link_stat "/data/backups/genalblabdb07n1/v_verlabqa01_node0001/new_Jan_2_VERLABQA01/new_Jan_2_VERLABQA01.rst" (in vbr) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1508) [generator=3.0.7]
rsync failed!

25335: vbr client subproc on 15.224.232.117 terminates with returncode 1. Details in vbr_v_verlabqa01_node0003_client.log on that host.
Error msg:
rsync: link_stat "/data/backups/genalblabdb07n1/v_verlabqa01_node0003/new_Jan_2_VERLABQA01/new_Jan_2_VERLABQA01.rst" (in vbr) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1508) [generator=3.0.7]
rsync failed!

Child processes terminated abnormally.
restore failed!

Thanks
Saumya

Arun_Prasad · December 2016

Hi Saumya,

I'm facing the same issue. Do you still remember how did you solve this problem?

Thanks

Arun

unnikpr123 · May 2018

Hi,

We are facing same issue. How did you resolved this?. We are contacting technical support but no resolution.

Regards,
Unni.

Jim_Knicely · May 2018

@unnikpr123 - Is your node down? If so, did you try starting it with admintools and the force option?

admintools -t restart_node -F -s <downed node I> -d <dbname>

The node should recover data from buddy projections.

file got deleted on one node

Comments

Leave a Comment