backup fails with SIGINT Error
Hi,
I am taking a backup of a 500G database using vbr.py script onto a remote host.
The backup is sometimes failing with the following message
8526: vbr client subproc on 15.224.232.116 terminates with returncode 255. Details in vbr_v_verlabqa01_node0002_client.log on that host.
Connection to 15.224.232.116 closed by remote host.
8525: vbr client subproc on 15.224.232.115 terminates with returncode 255. Details in vbr_v_verlabqa01_node0001_client.log on that host.
In the log the error that is seen when it fails is just that it is cancelled by SIGINT. I do not understand whats the reason for this. I didnt terminate the session. In fact I ran it in the background so that it shouldnt face this issue.
Later if I try to backup again sometimes it is successful and sometimes it fails. Please help me understand what is happening here.
data/VERLABQA01/v_verlabqa01_node0002_data/481/49539595901145481/
data/VERLABQA01/v_verlabqa01_node0002_data/481/49539595901145481/49539595901145481_0.fdb
32768 0% 864.86kB/s 0:08:45
40435712 8% 38.56MB/s 0:00:10
82444288 18% 39.33MB/s 0:00:09
126681088 27% 40.28MB/s 0:00:07
159186944 35% 37.97MB/s 0:00:07
2014-01-06 15:26:10 cancelled by SIGINT
Thanks
Saumya
I am taking a backup of a 500G database using vbr.py script onto a remote host.
The backup is sometimes failing with the following message
8526: vbr client subproc on 15.224.232.116 terminates with returncode 255. Details in vbr_v_verlabqa01_node0002_client.log on that host.
Connection to 15.224.232.116 closed by remote host.
8525: vbr client subproc on 15.224.232.115 terminates with returncode 255. Details in vbr_v_verlabqa01_node0001_client.log on that host.
In the log the error that is seen when it fails is just that it is cancelled by SIGINT. I do not understand whats the reason for this. I didnt terminate the session. In fact I ran it in the background so that it shouldnt face this issue.
Later if I try to backup again sometimes it is successful and sometimes it fails. Please help me understand what is happening here.
data/VERLABQA01/v_verlabqa01_node0002_data/481/49539595901145481/
data/VERLABQA01/v_verlabqa01_node0002_data/481/49539595901145481/49539595901145481_0.fdb
32768 0% 864.86kB/s 0:08:45
40435712 8% 38.56MB/s 0:00:10
82444288 18% 39.33MB/s 0:00:09
126681088 27% 40.28MB/s 0:00:07
159186944 35% 37.97MB/s 0:00:07
2014-01-06 15:26:10 cancelled by SIGINT
Thanks
Saumya
0
Comments
Can you try increasing the number of concurrent connections ssh allows on the backup host.
I tried increasing the maxstartups parameter and re ran it.. It was successful..However still at times it takes three times to finish the backup with the same message but the backup is being successful eventually.
Is there anything else that might be causing it?
Thanks
Saumya
Is there any improvement after setting high value for that parameter?
Check if any old rsync is running, is yes kill it and run the vbr again.
Check the ssh time limits
Can you paste the whole error from vbr logs
Send me the output of grep MaxStartups /etc/ssh/sshd_config
Thanks,
Shobhit
[root@msalb002qvdb01 ~]# grep MaxStartups /etc/ssh/sshd_config
MaxStartups 30
[dbadmin@genalblabdb07n1 vbr_history_2014-01-09_14-15-06]$ cat vbr27307.log
2014-01-09 14:00:35 Helper cancelling process entry
2014-01-09 14:00:35 ps aux | grep /tmp/vbr/vbr.py | grep -v grep | grep client
2014-01-09 14:00:35 dbadmin 31949 0.2 0.0 165120 12048 ? Ss 13:45 0:01 /opt/vertica/oss/python/bin/python /tmp/vbr/vbr.py --task client --config-file /tmp/vbr/Full_Jan_.ini --context backup 15.224.232.62 /data/backups/genalblabdb07n1/v_verlabqa01_node0001/new_Jan__VERLABQA01 15.224.232.115 /catalog/VERLABQA01/v_verlabqa01_node0001_catalog/Snapshots/new_Jan__VERLABQA01 v_verlabqa01_node0001 VERLABQA01 False
2014-01-09 14:00:35 Helper cancelling process exit
[dbadmin@genalblabdb07n1 vbr_history_2014-01-09_14-15-06]$ tail -100 vbr_v_verlabqa01_node0001_client.log
data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273787755/
data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273787755/45035996273787755_0.fdb
32768 1% 64.52kB/s 0:00:37
2446078 100% 3.96MB/s 0:00:00 (xfer#1707, to-check=1002/6281)
data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273787755/45035996273787755_0.pidx
11424 100% 18.72kB/s 0:00:00
11424 100% 18.69kB/s 0:00:00 (xfer#1708, to-check=1001/6281)
data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273806755/
data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273806755/45035996273806755_0.fdb
31 100% 0.05kB/s 0:00:00
31 100% 0.05kB/s 0:00:00 (xfer#1709, to-check=1002/6283)
data/VERLABQA01/v_verlabqa01_node0001_data/755/45035996273806755/45035996273806755_0.pidx
56 100% 0.09kB/s 0:00:00
56 100% 0.09kB/s 0:00:00 (xfer#1710, to-check=1001/6283)
data/VERLABQA01/v_verlabqa01_node0001_data/757/
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273721757/
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273721757/45035996273721757_0.fdb
16922 100% 27.45kB/s 0:00:00
16922 100% 27.45kB/s 0:00:00 (xfer#1711, to-check=1001/6291)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273721757/45035996273721757_0.pidx
168 100% 0.27kB/s 0:00:00
168 100% 0.27kB/s 0:00:00 (xfer#1712, to-check=1000/6291)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273761757/
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273761757/45035996273761757_0.fdb
32768 0% 52.89kB/s 0:05:14
16678591 100% 17.48MB/s 0:00:00 (xfer#1713, to-check=1001/6293)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273761757/45035996273761757_0.pidx
32768 28% 35.09kB/s 0:00:02
114016 100% 121.82kB/s 0:00:00 (xfer#1714, to-check=1000/6293)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273763757/
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273763757/45035996273763757_0.fdb
32768 0% 34.93kB/s 0:11:05
4489216 19% 4.28MB/s 0:00:04
23292489 100% 15.79MB/s 0:00:01 (xfer#1715, to-check=1003/6297)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273763757/45035996273763757_0.pidx
32768 82% 78.24kB/s 0:00:00
39536 100% 94.40kB/s 0:00:00 (xfer#1716, to-check=1002/6297)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273781757/
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273781757/45035996273781757_0.fdb
327 100% 0.78kB/s 0:00:00
327 100% 0.78kB/s 0:00:00 (xfer#1717, to-check=1001/6297)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273781757/45035996273781757_0.pidx
56 100% 0.13kB/s 0:00:00
56 100% 0.13kB/s 0:00:00 (xfer#1718, to-check=1000/6297)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273783757/
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273783757/45035996273783757_0.fdb
32768 0% 77.29kB/s 0:03:35
16682528 100% 21.73MB/s 0:00:00 (xfer#1719, to-check=1001/6299)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273783757/45035996273783757_0.pidx
32768 28% 43.54kB/s 0:00:01
114016 100% 151.28kB/s 0:00:00 (xfer#1720, to-check=1000/6299)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273785757/
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273785757/45035996273785757_0.fdb
32768 1% 43.24kB/s 0:00:55
2445778 100% 2.90MB/s 0:00:00 (xfer#1721, to-check=1001/6301)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273785757/45035996273785757_0.pidx
11424 100% 13.86kB/s 0:00:00
11424 100% 13.86kB/s 0:00:00 (xfer#1722, to-check=1000/6301)
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273788757/
data/VERLABQA01/v_verlabqa01_node0001_data/757/45035996273788757/45035996273788757_0.fdb
32768 0% 39.56kB/s 17:28:55
9404416 0% 8.97MB/s 0:04:30
62423040 2% 29.78MB/s 0:01:19
109871104 4% 34.93MB/s 0:01:06
154075136 6% 36.74MB/s 0:01:02
196575232 7% 44.64MB/s 0:00:50
237371392 9% 41.72MB/s 0:00:52
276856832 11% 39.81MB/s 0:00:54
312475648 12% 37.77MB/s 0:00:56
341868544 13% 33.99MB/s 0:01:01
385384448 15% 34.63MB/s 0:00:59
423591936 17% 34.37MB/s 0:00:58
464027648 18% 35.47MB/s 0:00:55
500334592 20% 37.78MB/s 0:00:51
543064064 21% 37.60MB/s 0:00:50
585924608 23% 38.70MB/s 0:00:48
624787456 25% 38.35MB/s 0:00:47
671907840 26% 40.90MB/s 0:00:43
715948032 28% 41.21MB/s 0:00:42
765198336 30% 42.73MB/s 0:00:39
799703040 32% 41.69MB/s 0:00:39
841482240 33% 40.45MB/s 0:00:39
879755264 35% 39.08MB/s 0:00:40
922189824 37% 37.46MB/s 0:00:40
965246976 38% 39.45MB/s 0:00:37
1005912064 40% 39.17MB/s 0:00:36
1044938752 41% 39.34MB/s 0:00:35
1082130432 43% 38.09MB/s 0:00:36
1124892672 45% 38.02MB/s 0:00:35
1167687680 46% 38.52MB/s 0:00:33
1210613760 48% 39.42MB/s 0:00:31
1252032512 50% 40.42MB/s 0:00:29
1294598144 52% 40.43MB/s 0:00:28
1338703872 53% 40.74MB/s 0:00:27
1380024320 55% 40.41MB/s 0:00:26
1417019392 56% 39.37MB/s 0:00:26
1458896896 58% 39.20MB/s 0:00:25
1498578944 60% 38.15MB/s 0:00:25
1544847360 62% 39.16MB/s 0:00:23
1578139648 632014-01-09 14:00:35 cancelled by SIGINT
What is vbr.py's command-line output, if any?
I am not ssh'ing into the remote server, I run it from the server which needs the backup to be taken and the remote server name is just in the configuration file.
/opt/vertica/bin/vbr.py --task backup --config-file /opt/vertica/config/Full_Jan_2.ini
[dbadmin@genalblabdb07n1 vertica_scripts]$ cat /opt/vertica/config/Full_Jan_2.ini
[Misc]
snapshotName = new_Jan_2_VERLABQA01
verticaConfig = True
restorePointLimit = 7
tempDir = /tmp/vbr
retryCount = 2
retryDelay = 1
[Database]
dbName = VERLABQA01
dbUser = dbadmin
dbPassword = *****
[Transmission]
encrypt = False
checksum = False
port_rsync = 50000
bwlimit = 0
[Mapping1]
dbNode = v_verlabqa01_node0001
backupHost = msalb002qvdb01
backupDir = /data/backups/genalblabdb07n1
[Mapping2]
dbNode = v_verlabqa01_node0002
backupHost = msalb002qvdb01
backupDir = /data/backups/genalblabdb07n1
[Mapping3]
dbNode = v_verlabqa01_node0003
backupHost = msalb002qvdb01
backupDir = /data/backups/genalblabdb07n1
### Below is the message from the vbr.py when run in the background
Database Size: 718190798401 : 16:50:41
Starting database backup : 16:50:41
Preparing...
Found Database port: 5433
Copying...
[..................................................] 0%[..................................................] 1%[=.................................................] 2%[=.................................................] 3%[==................................................] 4%[==................................................] 5%[===...............................................] 6%[===...............................................] 7%[====..............................................] 8%[====..............................................] 9%[=====.............................................] 10%[=====.............................................] 11%[======............................................] 12%[======............................................] 13%16663: vbr client subproc on 15.224.232.115 terminates with returncode 255. Details in vbr_v_verlabqa01_node0001_client.log on that host.
Error msg:
Connection to 15.224.232.115 closed by remote host.
Child processes terminated abnormally.
backup failed!
cleaning up...
16666: vbr client subproc on 15.224.232.117 terminates with returncode 255. Details in vbr_v_verlabqa01_node0003_client.log on that host.
Error msg:
Connection to 15.224.232.117 closed by remote host.
16664: vbr client subproc on 15.224.232.116 terminates with returncode 255. Details in vbr_v_verlabqa01_node0002_client.log on that host.
Error msg:
Connection to 15.224.232.116 closed by remote host.
Retrying... #1
Copying...
[======............................................] 13%[..................................................] 1%[=.................................................] 2%[=.................................................] 3%[==................................................] 4%[==................................................] 5%[===...............................................] 6%[===...............................................] 7%[====..............................................] 8%[====..............................................] 9%[=====.............................................] 10%[=====.............................................] 11%[======............................................] 12%[======............................................] 13%[=======...........................................] 14%9660: vbr client subproc on 15.224.232.117 terminates with returncode 255. Details in vbr_v_verlabqa01_node0003_client.log on that host.
Error msg:
Connection to 15.224.232.117 closed by remote host.
9657: vbr client subproc on 15.224.232.115 terminates with returncode 255. Details in vbr_v_verlabqa01_node0001_client.log on that host.
Error msg:
Connection to 15.224.232.115 closed by remote host.
9658: vbr client subproc on 15.224.232.116 terminates with returncode 255. Details in vbr_v_verlabqa01_node0002_client.log on that host.
Error msg:
Connection to 15.224.232.116 closed by remote host.
Child processes terminated abnormally.
backup failed!
cleaning up...
Retrying... #2
Copying...
[=======...........................................] 14%[..................................................] 1%[=.................................................] 2%[=.................................................] 3%[==................................................] 4%[==................................................] 5%[===...............................................] 6%[===...............................................] 7%[====..............................................] 8%[====..............................................] 9%[=====.............................................] 10%[=====.............................................] 11%[======............................................] 12%[======............................................] 13%[=======...........................................] 14%[=======...........................................] 15%[========..........................................] 16%[========..........................................] 17%[=========.........................................] 18%2176: vbr client subproc on 15.224.232.116 terminates with returncode 255. Details in vbr_v_verlabqa01_node0002_client.log on that host.
Error msg:
Connection to 15.224.232.116 closed by remote host.
2177: vbr client subproc on 15.224.232.117 terminates with returncode 255. Details in vbr_v_verlabqa01_node0003_client.log on that host.
Error msg:
Connection to 15.224.232.117 closed by remote host.
2175: vbr client subproc on 15.224.232.115 terminates with returncode 255. Details in vbr_v_verlabqa01_node0001_client.log on that host.
Error msg:
Connection to 15.224.232.115 closed by remote host.
Child processes terminated abnormally.
backup failed!
cleaning up...
Backup Failed. : 17:36:20
When the connection is closed, Linux closes all sub-processes associated with the connection on the Vertica side because their parent has been torn down. It does so with a SIGINT, so that's where that comes from.
The backup machine sees that the connection that it established has been closed, and indicates as much in its error message.
So, question is, what's closing your connections? Do you have a firewall, router, load balancer, or any other sort of appliance between your backup server (where you're running vbr.py) and your Vertica cluster? We generally recommend having a direct connection if possible. If you do have a firewall of some sort, it is probably configured to close connections that run (and/or that appear idle) for too long. In which case you should update your firewall configuration -- the backup process can be lengthy, so has many long-running connections. vbr.py internally uses ssh for some of its commands, so you may be able to work around the issue by setting the "ServerAliveInterval" option in your SSH client config file on your machines.
Thanks Adam for the information.
We disabled the firewall between the vertica cluster and the backup server.
I am running the script on one node of the vertica cluster and in the configuration file I mentioned the backup host to be the remote backup host. Should I run the script on the remote backup host?
I will test by setting the ServerAliveInterval and will see if it works and will update.
Setting of ServerAliveInterval works like a charm
Thank you so much
Regards
Saumya
I had the same scenario as Saumya and this is what I did...
#MaxStartups 10:30:60 /etc/ssh/sshd_config
#ServerAliveInterval 9000 /etc/ssh/ssh_config
#ClientAliveCountMax 3000 /etc/ssh/sshd_config
#ClientAliveInterval 9000 /etc/ssh/sshd_config
change these parameters in both the 3 node cluster and backup host as well...
It works....
how to set the ServerAliveInterval ? me too facing the same error.
but one difference is backup server also the same machine.
could you please in this regard.
Regards,
Eswar
In /etc/ssh/ssh_config file on linux box add the line ServerAliveInterval 9000
Thanks
Saumya
I have changed the parameter and tried .but same error
18040: vbr client subproc on xx.xxx.xxx.xxx terminates with returncode 1
could you please help me.i am trying to take the schema level back up.
Regards,
Eswar
18040: vbr client subproc on xx.xxx.xxx.xxx terminates with returncode 1
could you please help me.i am trying to take the schema level back up.
Regards,
Eswar