Vertica backup taking long time

Hello,

 

We have a 3 node cluster having database size of 10 TB.

Backup is taking more than 40 hours to finish.

 

Is it the expceted time or are there any ways to fine tune the backup process?

 

Thanks!

Comments

  • Hello,

     

    Can somebody help me here?

     

    Thanks!

  • Hi abhi112,

     

    I don't know whether that's the expexted time. The time required depends on your hardware, on how you make the backup (hardlink, local, backup-host) and on the configuration in vbr.ini. I recently had to deal with the customer support and found a 'backup takes too long' entry in the 'recent items' box. In that case the CHECKSUM parameter was set to true, which obviously adds a lot of load to rsync.

     

     

  • Hi Rok,

     

    Thanks for the response.

    Any idea on what changes can we make here to improve the situation or atleast can try to see whether we can reduce the time?

     

    Thanks! 

  • check the CHECKSUM parameter :-)

     

    First let me tell you that I am relatively new to vertica, so my experience is very limited, but I collected quite a bit of "generic" experience.  Don't get me wrong, but nobody could give you any advice as long as you are not telling something about your setup.

  • Hi Rok,

     

    Thanks for the response.

    I appreciate the effort that you are taking here to help me out despite being an amateur like me. :-)

     

    Below is the .ini file which we have for the backup:

     

    [Misc]
    snapshotName = FULLDB
    restorePointLimit = 1
    passwordFile = /home/dbadmin/dbadmin_task/config/backup/dbpass.pwd
    tempDir = /home/dbadmin/dbadmin_task/log/backup

    [Database]
    dbName = EDWPRDDB
    dbUser = dbadmin

    [Transmission]

    [Mapping]
    v_edwprddb_node0001 = EDWPRDVER01:/backup/EDWPRDDB
    v_edwprddb_node0002 = EDWPRDVER02:/backup/EDWPRDDB
    v_edwprddb_node0003 = EDWPRDVER03:/backup/EDWPRDDB

     

    Let me know what other information should i furnish so that you can look and find if there is something.

     

    Thanks!

  • Hi abhi112,

     

    I suppose, that

     

    v_edwprddb_node0001 = EDWPRDVER01:/backup/EDWPRDDB
    v_edwprddb_node0002 = EDWPRDVER02:/backup/EDWPRDDB
    v_edwprddb_node0003 = EDWPRDVER03:/backup/EDWPRDDB

     

    means that you are making local backups, so problems with the backup host or network throughput can't be

    the reason, execpt /backup is mounted via nfs or iscsi. Is /backup mounted on slow disks (portable ones) ?

    I would check each link of the I/O chain for bottlenecks concerning speed.

     

    But, btw, vertica performs level 1 backups after the first (level 0) backup, so only the changes are written

    to the disks. Does than take 10h too ?

     

     

     

  • Hi Rok,

     

    Thanks for the reponse.

    Actually, the /backup is mounted via NFS which is on an HP StoreEasy 1840 server which runs on Windows 2012.

    Also ,we have the restore point set to 1.

    So everytime backup is running is the full backup which initaially will delete the old backup if any, start the dry run to identify the files to backup and then the backup process (transfer of files) will start.

     

    Thanks!

  • What's the Vertica version?

     

  • Hi Tharanga,

     

    It is v7.1.2-1

  • Hi abhi112,

     

    you wrote: "Also ,we have the restore point set to 1. So everytime backup is running is the full backup which initaially will delete the old backup if any, start the dry run to identify the files to backup and then the backup process (transfer of files) will start."

     

    but as far as I understand, vbr.py always creates a level 1 backup, except for the first run, at least that's what I mean to read in the online documentation :

     

    Saving Incremental Backups

    Each time you back up your database with the same configuration file, vbr.py creates an incremental backup, copying new storage containers, which can include data that existed the last time you backed up the database, along with new and changed data since then. By default, vbr.py saves one current backup and one historical backup. You can configure the restore parameter to increase the number of stored backups.

     

    Setting restore_point_limit to 1 means "keep one backup" and that's exactly how my server behaves.

    The first run of vbr.py took quite a while and those following needed a fraction of the initial time.

     

    As I already told you, I recently had to deal with the supportand after loggin in I found this "Questions on full vs. incremental backups and issues with Windows backuphost" under "recent items. It seems that exactly your problem is descibed there. Due to an issues with the windows file system timestamps don't match an full backup are taken instead of incremental ones.I think, you should log in and hav a look at it. (but unfortunately you might not like the drawn conclusion - move to a non windows backup system)

     

    Greets

      rok

  • Hi abhi112,

     

    I wrote quite a lot and everything was discarded, so I'll make a shorter attempt now. Setting restore_point_limit to 1 means "keep one full set of data, create one restore point. Only the first run of vbr.py is should create a full backup, the following ones are  should be inremental. The (incorrect) behaviour of your cluster might be due to an issue in the windows file system, where timestamps don't exactly match those on the Linux box(es). I stumbled over that item on the support page, you might want to log in and read it, it's under "recent items".

    (The way how backups are taken is also explained there.) Maybe you will not like the conclusion: "Ultimately customer moved their backuphost and file system to a different non Windows environment and backups worked as expected. " Wish I could have brought better news.

     

    Greets

        rok

     

    PS: Did you syncronize your backup host via ntp ?

     

  • sorry - first post was obviously not discarded, I was too stupid to see there's more that one page ...

  • Vertica will always keep restore point limit + 1 backups. Therefore none of the subsequent backups are full backups, they are all incremental. 

     

    Did you say your NFS mounts are on a Windows server ? We have seen Windows round up or truncate timestamps; hence Vbr thinks it's a different file and going through a full backup, copying all the files over and over. 

     

    There are many factors contributing to the speed. Major factors are your network and disk IO. We've seen slow NFS mounts due to disk or network problems. You can usually troubleshoot these by looking through logs (vbr and other system logs) or using tools to measure IO performance.

     

    Database design can also contrinubte to slow backups. This happens when you have millions of small files instead of hundreds thounsads of big files. You can run following query and see whether you have projections that suffer from this problem.

     

    (Disclaimer vs_ros in as internal table and it can change without any notice. storage_containers is not, it's meant to be used by customers) 

    select median(size) over() as median_fsize from vs_ros as ros, storage_containers as cont where ros.delid=cont.storage_oid and cont.node_name=‘node' and
    cont.projection_name=‘proj_name' limit 1;

     

    We re-designed the Vbr tool in 7.2 which makes backups much faster, especially incremental backups. We also introduced a new bundled file format (again in 7.2) which reduces the number of files by 2X or more (we've seen 10X reduction in the field).

     

     

     

  • Thaks a lot Rok and Tharanga for the response.

    I will consider all your points and check what could be wrong with our setup.

     

    Thanks!

  • Hi abhi112

     

    did you already find a solution ?

     

    Greets

        rok

  • Hi Rok,

     

    Unfortunately, no. :(

  • Did you find the problem ?

  • Hi Thranga,

     

    Sorry for the late response. 

    I figured that backup server is time sync with the vertica cluster.

    Now i can see 2 copies of backup (restore point is set to 1).

    But the incremental backup also took close to 20 hours. 

     

    I have checked the database design as well as per the query provided by you and it seeems to be fine.

     

    FYA, I have pasted below the client.log for the last weekend backup:

     

    Number of files: 4005984
    Number of files transferred: 1175105
    Total file size: 1679456442418 bytes
    Total transferred file size: 570404685682 bytes
    Literal data: 570404685682 bytes
    Matched data: 0 bytes
    File list size: 94837556
    File list generation time: 0.002 seconds
    File list transfer time: 0.000 seconds
    Total bytes sent: 570624367052
    Total bytes received: 27670681

    sent 570624367052 bytes received 27670681 bytes 8815677.65 bytes/sec
    total size is 1679456442418 speedup is 2.94
    2016-10-16 05:23:25 Transfer client process exit

     

     

    Please let me know if you can give more insight to it.

     

    Thanks,

    Abhi

  • What's the median file size reported by that query ?

     

    If you just calculate the avg file size from above reported numbers, it falls in to ~Kb in size (avg is not a good indicator though), files seems to be very small.

     

    For 3 nodes, the last incremental backup had transfered more than a million of files. Your database does seems to suffer from the "small files" problem, among other issues, this makes the backup performance worse. Creating millions of hardlinks for each backup will slow down things further.

     

    Are you planning to upgrade to 7.2 or later anytime soon? As I stated before, we did 2 major improvements in 7.2.

    1. Overhauled Vbr and got rid of the hard link mechanism.

    2. Introduced compact on-disk file format that reduces the number of files, usually by 10X.

     

    Wide column tables (100's of columns), very granular partioning (partition by hour), local segmentation (disabled by default, unless you enable it manually), frequent trickle loads and merge out cannot keep up are some of the causes that can create millions of small files.

     

    There are some design improvements you can do (improve partitioning, etc) but cannot get major improvements like 10x or more speed up without upgrading.

  • Thanks Tharanga for the details.

    We will make suggestions to the client for the upgrade.

     

    Thanks,

    Abhi

  • Hi Tharanga,

     

    I know it's OT, but: do you know more about the problem with windows ? Comparing timestamps seems to work here (?) but not on other systems. Is it a matter of the operating- or the file system ?

     

    Greets

        rok

  • Not clear whether it worked here or not. I saw this problem on a Windows server hosted NFS (cannot recall which Windows server version it was) back in 2014. It had skipped the millisecond part, hence backup thinks the file is modified and copies it over and over.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file