Options

backup fails with SIGINT Error

Hi,

I am taking a backup of a 500G database using vbr.py script onto a remote host.

The backup is sometimes failing with the following message

8526: vbr client subproc on 15.224.232.116 terminates with returncode 255. Details in vbr_v_verlabqa01_node0002_client.log on that host.
Connection to 15.224.232.116 closed by remote host.

8525: vbr client subproc on 15.224.232.115 terminates with returncode 255. Details in vbr_v_verlabqa01_node0001_client.log on that host.


In the log the error that is seen when it fails is just that it is cancelled by SIGINT. I do not understand whats the reason for this. I didnt terminate the session. In fact I ran it in the background so that it shouldnt face this issue.

Later if I try to backup again sometimes it is successful and sometimes it fails. Please help me understand what is happening here.

data/VERLABQA01/v_verlabqa01_node0002_data/481/49539595901145481/
data/VERLABQA01/v_verlabqa01_node0002_data/481/49539595901145481/49539595901145481_0.fdb
       32768   0%  864.86kB/s    0:08:45
    40435712   8%   38.56MB/s    0:00:10
    82444288  18%   39.33MB/s    0:00:09
   126681088  27%   40.28MB/s    0:00:07
   159186944  35%   37.97MB/s    0:00:07
   2014-01-06 15:26:10 cancelled by SIGINT

Thanks
Saumya

Comments

  • Options
    Hi Saumya,

    Can you try increasing the number of concurrent connections ssh allows on the backup host.
      You can modify the parameter MaxStartups in /etc/ssh/sshd_config.    Thanks,  Shobhit  
  • Options
    Shobhit,

    I tried increasing the maxstartups parameter and re ran it.. It was successful..However still at times it takes three times to finish the backup with the same message but the backup is being successful eventually.

    Is there anything else that might be causing it?

    Thanks
    Saumya
  • Options
    Hi Saumya,

    Is there any improvement after setting high value for that parameter?
    Check if any old rsync is running, is yes kill it and run the vbr again.
    Check the ssh time limits

    Can you paste the whole error from vbr logs
    Send me the output of grep MaxStartups /etc/ssh/sshd_config

    Thanks,
    Shobhit
  • Options
    No there is no improvement. I am trying to backup 690GB database and it is always failing now on its full backup with the same issue.

    [root@msalb002qvdb01 ~]# grep MaxStartups /etc/ssh/sshd_config
    MaxStartups 30


    [dbadmin@genalblabdb07n1 vbr_history_2014-01-09_14-15-06]$ cat vbr27307.log
    2014-01-09 14:00:35 Helper cancelling process entry
    2014-01-09 14:00:35 ps aux | grep /tmp/vbr/vbr.py | grep -v grep | grep client
    2014-01-09 14:00:35 dbadmin  31949  0.2  0.0 165120 12048 ?        Ss   13:45   0:01 /opt/vertica/oss/python/bin/python /tmp/vbr/vbr.py --task client --config-file /tmp/vbr/Full_Jan_.ini --context backup 15.224.232.62 /data/backups/genalblabdb07n1/v_verlabqa01_node0001/new_Jan__VERLABQA01 15.224.232.115 /catalog/VERLABQA01/v_verlabqa01_node0001_catalog/Snapshots/new_Jan__VERLABQA01 v_verlabqa01_node0001 VERLABQA01 False
    2014-01-09 14:00:35 Helper cancelling process exit



    [dbadmin@genalblabdb07n1 vbr_history_2014-01-09_14-15-06]$ tail -100 vbr_v_verlabqa01_node0001_client.log
    data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273787755/
    data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273787755/45035996273787755_0.fdb
           32768   1%   64.52kB/s    0:00:37
         2446078 100%    3.96MB/s    0:00:00 (xfer#1707, to-check=1002/6281)
    data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273787755/45035996273787755_0.pidx
           11424 100%   18.72kB/s    0:00:00
           11424 100%   18.69kB/s    0:00:00 (xfer#1708, to-check=1001/6281)
    data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273806755/
    data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273806755/45035996273806755_0.fdb
              31 100%    0.05kB/s    0:00:00
              31 100%    0.05kB/s    0:00:00 (xfer#1709, to-check=1002/6283)
    data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273806755/45035996273806755_0.pidx
              56 100%    0.09kB/s    0:00:00
              56 100%    0.09kB/s    0:00:00 (xfer#1710, to-check=1001/6283)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273721757/
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273721757/45035996273721757_0.fdb
           16922 100%   27.45kB/s    0:00:00
           16922 100%   27.45kB/s    0:00:00 (xfer#1711, to-check=1001/6291)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273721757/45035996273721757_0.pidx
             168 100%    0.27kB/s    0:00:00
             168 100%    0.27kB/s    0:00:00 (xfer#1712, to-check=1000/6291)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273761757/
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273761757/45035996273761757_0.fdb
           32768   0%   52.89kB/s    0:05:14
        16678591 100%   17.48MB/s    0:00:00 (xfer#1713, to-check=1001/6293)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273761757/45035996273761757_0.pidx
           32768  28%   35.09kB/s    0:00:02
          114016 100%  121.82kB/s    0:00:00 (xfer#1714, to-check=1000/6293)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273763757/
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273763757/45035996273763757_0.fdb
           32768   0%   34.93kB/s    0:11:05
         4489216  19%    4.28MB/s    0:00:04
        23292489 100%   15.79MB/s    0:00:01 (xfer#1715, to-check=1003/6297)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273763757/45035996273763757_0.pidx
           32768  82%   78.24kB/s    0:00:00
           39536 100%   94.40kB/s    0:00:00 (xfer#1716, to-check=1002/6297)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273781757/
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273781757/45035996273781757_0.fdb
             327 100%    0.78kB/s    0:00:00
             327 100%    0.78kB/s    0:00:00 (xfer#1717, to-check=1001/6297)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273781757/45035996273781757_0.pidx
              56 100%    0.13kB/s    0:00:00
              56 100%    0.13kB/s    0:00:00 (xfer#1718, to-check=1000/6297)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273783757/
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273783757/45035996273783757_0.fdb
           32768   0%   77.29kB/s    0:03:35
        16682528 100%   21.73MB/s    0:00:00 (xfer#1719, to-check=1001/6299)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273783757/45035996273783757_0.pidx
           32768  28%   43.54kB/s    0:00:01
          114016 100%  151.28kB/s    0:00:00 (xfer#1720, to-check=1000/6299)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273785757/
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273785757/45035996273785757_0.fdb
           32768   1%   43.24kB/s    0:00:55
         2445778 100%    2.90MB/s    0:00:00 (xfer#1721, to-check=1001/6301)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273785757/45035996273785757_0.pidx
           11424 100%   13.86kB/s    0:00:00
           11424 100%   13.86kB/s    0:00:00 (xfer#1722, to-check=1000/6301)
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273788757/
    data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273788757/45035996273788757_0.fdb
           32768   0%   39.56kB/s   17:28:55
         9404416   0%    8.97MB/s    0:04:30
        62423040   2%   29.78MB/s    0:01:19
       109871104   4%   34.93MB/s    0:01:06
       154075136   6%   36.74MB/s    0:01:02
       196575232   7%   44.64MB/s    0:00:50
       237371392   9%   41.72MB/s    0:00:52
       276856832  11%   39.81MB/s    0:00:54
       312475648  12%   37.77MB/s    0:00:56
       341868544  13%   33.99MB/s    0:01:01
       385384448  15%   34.63MB/s    0:00:59
       423591936  17%   34.37MB/s    0:00:58
       464027648  18%   35.47MB/s    0:00:55
       500334592  20%   37.78MB/s    0:00:51
       543064064  21%   37.60MB/s    0:00:50
       585924608  23%   38.70MB/s    0:00:48
       624787456  25%   38.35MB/s    0:00:47
       671907840  26%   40.90MB/s    0:00:43
       715948032  28%   41.21MB/s    0:00:42
       765198336  30%   42.73MB/s    0:00:39
       799703040  32%   41.69MB/s    0:00:39
       841482240  33%   40.45MB/s    0:00:39
       879755264  35%   39.08MB/s    0:00:40
       922189824  37%   37.46MB/s    0:00:40
       965246976  38%   39.45MB/s    0:00:37
      1005912064  40%   39.17MB/s    0:00:36
      1044938752  41%   39.34MB/s    0:00:35
      1082130432  43%   38.09MB/s    0:00:36
      1124892672  45%   38.02MB/s    0:00:35
      1167687680  46%   38.52MB/s    0:00:33
      1210613760  48%   39.42MB/s    0:00:31
      1252032512  50%   40.42MB/s    0:00:29
      1294598144  52%   40.43MB/s    0:00:28
      1338703872  53%   40.74MB/s    0:00:27
      1380024320  55%   40.41MB/s    0:00:26
      1417019392  56%   39.37MB/s    0:00:26
      1458896896  58%   39.20MB/s    0:00:25
      1498578944  60%   38.15MB/s    0:00:25
      1544847360  62%   39.16MB/s    0:00:23
      1578139648  632014-01-09 14:00:35 cancelled by SIGINT






  • Options
    Out of curiosity, how are you running vbr.py?  For example, are you ssh'ing into the server in question?

    What is vbr.py's command-line output, if any?
  • Options
    I am using a shell script that I have created which uses vbr.py, creates a configuration file and then runs the vbr.py task
     
    I am not ssh'ing into the remote server, I run it from the server which needs the backup to be taken and the remote server name is just in the configuration file.

    /opt/vertica/bin/vbr.py --task backup --config-file /opt/vertica/config/Full_Jan_2.ini

    [dbadmin@genalblabdb07n1 vertica_scripts]$ cat /opt/vertica/config/Full_Jan_2.ini
    [Misc]
    snapshotName = new_Jan_2_VERLABQA01
    verticaConfig = True
    restorePointLimit = 7
    tempDir = /tmp/vbr
    retryCount = 2
    retryDelay = 1

    [Database]
    dbName = VERLABQA01
    dbUser = dbadmin
    dbPassword = *****

    [Transmission]
    encrypt = False
    checksum = False
    port_rsync = 50000
    bwlimit = 0


    [Mapping1]
    dbNode = v_verlabqa01_node0001
    backupHost = msalb002qvdb01
    backupDir = /data/backups/genalblabdb07n1



    [Mapping2]
    dbNode = v_verlabqa01_node0002
    backupHost = msalb002qvdb01
    backupDir = /data/backups/genalblabdb07n1



    [Mapping3]
    dbNode = v_verlabqa01_node0003
    backupHost = msalb002qvdb01
    backupDir = /data/backups/genalblabdb07n1




  • Options

    ### Below is the message from the vbr.py when run in the background


    Database Size: 718190798401                               : 16:50:41
    Starting database backup                                  : 16:50:41
    Preparing...
    Found Database port:  5433
    Copying...
    [..................................................] 0%[..................................................] 1%[=.................................................] 2%[=.................................................] 3%[==................................................] 4%[==................................................] 5%[===...............................................] 6%[===...............................................] 7%[====..............................................] 8%[====..............................................] 9%[=====.............................................] 10%[=====.............................................] 11%[======............................................] 12%[======............................................] 13%16663: vbr client subproc on 15.224.232.115 terminates with returncode 255. Details in vbr_v_verlabqa01_node0001_client.log on that host.
    Error msg:
    Connection to 15.224.232.115 closed by remote host.

    Child processes terminated abnormally.
    backup failed!
    cleaning up...
    16666: vbr client subproc on 15.224.232.117 terminates with returncode 255. Details in vbr_v_verlabqa01_node0003_client.log on that host.
    Error msg:
    Connection to 15.224.232.117 closed by remote host.

    16664: vbr client subproc on 15.224.232.116 terminates with returncode 255. Details in vbr_v_verlabqa01_node0002_client.log on that host.
    Error msg:
    Connection to 15.224.232.116 closed by remote host.
    Retrying... #1
    Copying...
    [======............................................] 13%[..................................................] 1%[=.................................................] 2%[=.................................................] 3%[==................................................] 4%[==................................................] 5%[===...............................................] 6%[===...............................................] 7%[====..............................................] 8%[====..............................................] 9%[=====.............................................] 10%[=====.............................................] 11%[======............................................] 12%[======............................................] 13%[=======...........................................] 14%9660: vbr client subproc on 15.224.232.117 terminates with returncode 255. Details in vbr_v_verlabqa01_node0003_client.log on that host.
    Error msg:
    Connection to 15.224.232.117 closed by remote host.

    9657: vbr client subproc on 15.224.232.115 terminates with returncode 255. Details in vbr_v_verlabqa01_node0001_client.log on that host.
    Error msg:
    Connection to 15.224.232.115 closed by remote host.

    9658: vbr client subproc on 15.224.232.116 terminates with returncode 255. Details in vbr_v_verlabqa01_node0002_client.log on that host.
    Error msg:
    Connection to 15.224.232.116 closed by remote host.

    Child processes terminated abnormally.
    backup failed!
    cleaning up...
    Retrying... #2
    Copying...
    [=======...........................................] 14%[..................................................] 1%[=.................................................] 2%[=.................................................] 3%[==................................................] 4%[==................................................] 5%[===...............................................] 6%[===...............................................] 7%[====..............................................] 8%[====..............................................] 9%[=====.............................................] 10%[=====.............................................] 11%[======............................................] 12%[======............................................] 13%[=======...........................................] 14%[=======...........................................] 15%[========..........................................] 16%[========..........................................] 17%[=========.........................................] 18%2176: vbr client subproc on 15.224.232.116 terminates with returncode 255. Details in vbr_v_verlabqa01_node0002_client.log on that host.
    Error msg:
    Connection to 15.224.232.116 closed by remote host.

    2177: vbr client subproc on 15.224.232.117 terminates with returncode 255. Details in vbr_v_verlabqa01_node0003_client.log on that host.
    Error msg:
    Connection to 15.224.232.117 closed by remote host.

    2175: vbr client subproc on 15.224.232.115 terminates with returncode 255. Details in vbr_v_verlabqa01_node0001_client.log on that host.
    Error msg:
    Connection to 15.224.232.115 closed by remote host.

    Child processes terminated abnormally.
    backup failed!
    cleaning up...
    Backup Failed.                                            : 17:36:20

  • Options
    This sounds to me like vbr.py is connecting to the Vertica node in question to perform the backup, and the connection is being closed by some external mechanism.

    When the connection is closed, Linux closes all sub-processes associated with the connection on the Vertica side because their parent has been torn down.  It does so with a SIGINT, so that's where that comes from.

    The backup machine sees that the connection that it established has been closed, and indicates as much in its error message.

    So, question is, what's closing your connections?  Do you have a firewall, router, load balancer, or any other sort of appliance between your backup server (where you're running vbr.py) and  your Vertica cluster?  We generally recommend having a direct connection if possible.  If you do have a firewall of some sort, it is probably configured to close connections that run (and/or that appear idle) for too long.  In which case you should update your firewall configuration -- the backup process can be lengthy, so has many long-running connections.  vbr.py internally uses ssh for some of its commands, so you may be able to work around the issue by setting the "ServerAliveInterval" option in your SSH client config file on your machines.
  • Options

    Thanks Adam for the information.

    We disabled the firewall between the vertica cluster and the backup server.

    I am running the script on one node of the vertica cluster and in the configuration file I mentioned the backup host to be the remote backup host. Should I run the script on the remote backup host?

    I will test by setting the ServerAliveInterval and will see if it works and will update.
  • Options
    Adam,

    Setting of ServerAliveInterval works like a charm :)

    Thank you so much

    Regards
    Saumya
  • Options
    Testing  that dbadmin can do passwordless ssh versus all cluster nodes solved!
  • Options

    I had the same scenario as Saumya  and this is what I did...

    #MaxStartups 10:30:60                /etc/ssh/sshd_config

    #ServerAliveInterval  9000           /etc/ssh/ssh_config

    #ClientAliveCountMax  3000      /etc/ssh/sshd_config

    #ClientAliveInterval  9000         /etc/ssh/sshd_config


    change these parameters in both the  3 node cluster and backup host as well...

    It works....




  • Options
    Hi Saumya,

    how to set the ServerAliveInterval ? me too facing the same error.

    but one difference is backup server also the same machine.

    could you please in this regard.


    Regards,
    Eswar
  • Options
    Eswar,

    In /etc/ssh/ssh_config file on linux box add the line ServerAliveInterval  9000 

    Thanks
    Saumya

  • Options
    Hi Saumya,

    I have changed the parameter and tried .but same error

    18040: vbr client subproc on xx.xxx.xxx.xxx terminates with returncode 1

    could you please help me.i am trying to take the schema level back up.
    Regards,
    Eswar
  • Options
    Hi Adam,

    18040: vbr client subproc on xx.xxx.xxx.xxx terminates with returncode 1

    could you please help me.i am trying to take the schema level back up.
    Regards,
    Eswar

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file