Increase efficiency of backup

Zacharia_Mathew · June 2013

Hi, I have an thought to make the Vertica backup process (based on vbr.py) more efficient. Currently at a operating system file level, the same data is stored multiple times, many across multiple nodes for improved Vertica performance. This is good from a DB access perspective but needs to be managed properly for the backup strategy to be efficient. The current vbr.py backup process rsyncs the files to the backup server as if they are all different files. There are utilities can compare if 2 files are the same and instead keeping them as duplicates, eliminates one and replaces the eliminated file with a file of the same inode number of the second file. For example, check out the utility http://fossies.org/linux/privat/fslint-2.42.tar.gz:a/fslint-2.42/fslint/findup If Vertica can come up with a similar strategy, it can make the backups work easier. Zacharia Mathew

[Deleted User] · June 2013

Hi Zacharia, Thanks for the idea. This is definitely something that we've thought about, and that others have inquired about. The challenge is that, while we store the same *logical* data, that doesn't mean that the data *files* have the same content. For example, say you have two superprojections on a big table. Backing up both means you have two complete copies of the data. But say the data is sorted differently. So the data files won't be the same (fslint won't identify them; nor will fdupes, my personal utility of choice for this) and symlinking one onto the other would break the backup. We could store only one copy of the projection, and re-create the second projection from the first on restore. This would potentially be much more space-efficient. But it would also dramatically slow the restore process -- resegmenting a large projection can be a huge strain on system resources, depending on the projection details. So, would you like to be space-efficient or speed-efficient? Pick one :-) It is a good point, though -- right now we don't give you the option, and in cases where there are actually duplicate files (which are fewer than you might think), we don't deduplicate the files.

Zacharia_Mathew · June 2013

Thank you. Can you additionally look at the idea I posted with title "Improvements for Backup and Restore mechanism ". It is mentioned as implemented but I do not think so. Zacharia.

IgorM_1 · August 2013

How about computing (once) secure hash (like SHA1 - 20 bytes or 40 hex chars) for each file and storing it in the catalog or as a suffix (or prefix) in a filename? Computing this cheksum could be done at speeds of over 300MB/sec/thread. Then you'll only need to backup files if their checksum is different. IgorM

We're Moving!

Create My New Community Account Now

Increase efficiency of backup

Comments

Leave a Comment