Data lost after stopping database

Pedro_Salgueiro · January 2014

Hi,

I have recently started to use Vertica 7 Community Edition on Amazon AWS using 3 nodes and I am having some issues. I used the Vertica 7 images to create the EC2 machines, performed the install in all 3 nodes, created a database and started it with no problems.

Then, I created some tables, loaded some data and made a few queries with no problem.
After a few minutes, I stopped the database through the AdminTools, which seemed to have stopped it with no problems. Rebooted all 3 nodes, and then started the database through the AdminTools, which started with out complains.

To my surprise, when I connected to the database, all previously created tables were gone.I tried to replicate the problem, and I managed: Started the database, created tables, loaded data to the tables, checked if data was ok, stoped the database and then started it. After this, all tables were gone.

I checked in the logs that when the database is starting, it deletes some files, which seems to be the files containing the data of those tables which are being lost.

Then I made a new test: Created the tales and loaded the data, but before I stop the database, I made a backup of if, using vbr.py. After restarting it, the problem remained. So I stopped it, restored the backup, started it again, and all data was there. Did a few restarts to see if the data was dropped, and it wasn't.

Is this a normal situation? It seems a bit odd, since there are no warnings about potential data loss while issuing the database shutdown.

As far as I can understand, it seems that I am stopping the database while all data is not persisted on disk. Is this what's happening?

If so, is there any way to force all data to be persisted, so it wouldn't be lost at database restart?

Why does that after a restore, the data stops to being dropped at database restart?

Hope someone has an idea on this one!

[Deleted User] · January 2014

Hi Pedro,

Hm... Do the logs indicate anything about missing files or data corruption?

Also, how did you reboot the nodes?

Vertica will always present you with a consistent state. If it's missing data from a transaction, it will roll back the entirety of that transaction rather than give you partial data.

Vertica uses ordinary files to store its data. If Linux is not shut down cleanly, recently-modified file data can be corrupted. So if you shut down Vertica, then force-stop your VMs through the Amazon console (which is the functional equivalent of pressing and holding the power button on your PC) rather than using the "shutdown" command-line command, then the filesystem can see corruption of recently-modified files.

Running the backup generated a bunch of additional/newer files and probably encouraged Linux to sync Vertica's filesystem metadata to disk, so by the time you power-cycled your VM, everything was properly stored.

Adam

[Deleted User] · January 2014

When you create the database, the schema (design) only contains the dbadmin user definition. Any additional objects you create using DDL (e.g. tables, projections, users, roles) is automatically committed and persistent. Any data you add using DML may or may not be persistent depending on the commit options used:
- insert is not committed by default, you have to issue a "commit;"
- copy is committed by default, but it has a "no commit" option to alter that behaviour

It's important to clarify if you are losing only the data, or the data and the schema (DDL). From your description "all tables were gone" it sounds like on database cycling you lose both the DDL and DML. If "all tables are gone" actually means "all data in the tables is gone" then I think possibly data was not committed prior to shutdown. This also fits the backup/restore piece nicely as when you run backup it does a snapshot which does commits and moveouts to ensure the max amount of data is included in the backup. And once restored that data is persistent.

So the first thing to check is when you initially load the data are you using INSERT and if so are you following it with a commit. Next would be to confirm if only the DML is lost (select * from a table would give 0 rows) or if DML and DDL are lost (select * from a table would give an error re no relation tablename exists).

Hopefully the above will help resolve or at least provide a clearer picture of what's expected.

Pedro_Salgueiro · January 2014

About the data loss, I am loosing everything, DML and DDL, as the tables are completely gone, they don't show up when you issue the "\d" command.

At first, I was stopping the VM using the EC2 console, but then I started to stop the VMs the right way, with the halt command.

Then I stopped rebooting(stopping/starting) the VMs, restarting only the databases, which was causing the same problem. So, I think that we can assume that it's not a sync problem, since the file system was not umounted by the reboot of the VM.

Since by default the Vertica EC2 images are prepared to store data in /vertica/data, which is raid volume, I had to make sure that there was no problem with the raid array, so, I created a new database from scratch but stored it in /home/dbadmin (root fs), and the problem remained.

I created a small table, restarted the database, not the VM, and the table disapeared. I created the table in vsql with the following command:

CREATE TABLE Persons
(
PersonID int,
LastName varchar(255),
FirstName varchar(255),
Address varchar(255),
City varchar(255)
);

I looked at the logs, and saw no errors. The only thing that I caught my attention is the following, which happens while restarting the database.

2014-01-09 16:34:12.142 Main:0x5d28900 [Init] <INFO> Startup [Check Storage] Removing unnecessary storage files
 2014-01-09 16:34:12.143 Main:0x5d28900 <LOG> @v_teste1_node0001: 00000/5206: Unknown file in Libraries directory: [public_ApproximateLib_45035996273719484.so]
 2014-01-09 16:34:12.143 Main:0x5d28900 [SAL] <INFO> Queueing directory [/home/dbadmin/teste1/v_teste1_node0001_catalog/Libraries/public_ApproximateLib_45035996273719484.so.del] for removal
 2014-01-09 16:34:12.143 Main:0x5d28900 <LOG> @v_teste1_node0001: 00000/5206: Unknown file in Libraries directory: [public_FlexTableLib_45035996273719278.so]
 2014-01-09 16:34:12.143 Main:0x5d28900 [SAL] <INFO> Queueing directory [/home/dbadmin/teste1/v_teste1_node0001_catalog/Libraries/public_FlexTableLib_45035996273719278.so.del] for removal

Here is complete vertica.log found in /home/dbadmin/teste1/v_teste1_node0001_catalog/ for creating the database named "teste1", including the creation of the table "Persons" and restarting the database:

http://pastebin.com/6icQgEMW

If you guys find something wrong in it, let me know.

[Deleted User] · January 2014

Please locate the following two files
/home/dbadmin/teste1/v_teste1_node0001_catalog/Catalog/Txnlogs/txn_2_76.cat
/home/dbadmin/teste1/v_teste1_node0001_catalog/Catalog/Txnlogs/txn_2_2.cat

from the last test that you got the vertica.log from and post them. It looks like on startup there's some bad character that's causing it to try to fix the transactionlog so it can salvage the load of objects, which is probably why your ddl and dml is lost.

2014-01-09 16:34:11.600 unknown:0x7f61b6357700 [Init] <INFO> Startup [Reading Catalog] Applying transaction log (bytes) - 0 / 62252
2014-01-09 16:34:11.600 unknown:0x7f61b6357700 [Catalog] <WARNING> DESERIALIZER: Bad line \8B
2014-01-09 16:34:11.600 unknown:0x7f61b6357700 [Catalog] <INFO> Fixing... corrupted_txn_2_76.cat --> txn_2_2.cat, size 3010
2014-01-09 16:34:11.600 unknown:0x7f61b6357700 [Catalog] <WARNING> Partial or corrupted transaction read from /home/dbadmin/teste1/v_teste1_node0001_catalog/Catalog/Txnlogs/txn_2_76.cat

We may need you to collect some other info but this will help try to determine what's tripping up the load of the txn log.

Pedro_Salgueiro · January 2014

The "txn_2_76.cat" wasn't there any more, but there is a "corrupted_txn_2_76.cat". I also found a "corrupted_txn_3_9.cat".

corrupted_txn_2_76.cat: http://www.filesnack.com/files/ch3yr38p
corrupted_txn_3_9.cat: http://www.filesnack.com/files/c7na6nmh
txn_2_2.cat: http://www.filesnack.com/files/chi2vifn

Some of these files really have some nasty things on it, could it be related to the some corrupt vertica installation?

Another thing, when I first created the VMs, I used the c3.xlarge instances but then I changed them to c3.2xlarge. I guess this wouldn't cause any problem, right?

[Deleted User] · January 2014

Definitely seems to be the root cause. The corrupt files both has a large chunk of non xml unreadable detail but do include the info about the Persons table. The "fixed" 2_2.cat is clean but missing everything from the corrupt block onwards. I'm going to have someone review them to see if there's any known reason for the corrupted blocks.

The only AMI instance we test and support is the supplied community cc2.8xlarge so unsure of the ramifications of the change you made. Both of the ones you chose are in the same family just lower performance. Where you aren't heavily invested in the database yet it's a good time to do an uninstall/reinstall of vertica just to make sure it's not related.

Pedro_Salgueiro · January 2014

Managed to solve the problem!

Completely uninstalled Vertica on each node, including the rpm's installed on the system. Installed the rpm on each node, and then used the /opt/vertica/sbin/install_vertica to finish the install procedure.

After that, I managed to create databases, add tables and data to them, restart the databases and the VMs, and no data was lost. I also change the instance type from c3.large to c3.xlarge and had no problems.

I also managed to recover my old backup, add new tables and data to it, restarted the database and no data was lost.

I guess that something got broken in Vertica installation/configuration for some unknown reason. Any way, it's working and your help was precious to solve the problem

Thank you all!

[Deleted User] · January 2014

Glad you were able to get past iit with the deinstall/reinstall. I can only guess that, assuming you used the same rpm distrib this time around, there was some sort of uncompress/copy failure during that initial install. You added test of switching instance types again seems to prove it had nothing to do with the behaviour you saw.

We're Moving!

Create My New Community Account Now

Data lost after stopping database

Comments

Leave a Comment