Vertica Restore Fails with rsync: failed to connect to 172.30.0.13: Connection refused (111)
Hello,
I am trying to restore a Vertica 7.1.x DB and running into a failure with rsync.
I can ssh from the DBADMIN account to/from all system and the backup server.
Thank you so much for any help!
[dbadmin@ip-172-30-0-74 ~]$ /opt/vertica/bin/vbr.py --task restore --config-file ~/vertica/aws_vertica7_1_x_fullbak1.ini --debug 3
[{'dbHost': '', 'dbDir': '', 'dbNode': 'v_db_node0001', 'backupHost': '172.30.0.13', 'backupDir': '/home/dbadmin/vertica/backup/'}, {'dbHost': '', 'dbDir': '', 'dbNode': 'v_db_node0002', 'backupHost': '172.30.0.13', 'backupDir': '/home/dbadmin/vertica/backup/'}, {'dbHost': '', 'dbDir': '', 'dbNode': 'v_db_node0003', 'backupHost': '172.30.0.13', 'backupDir': '/home/dbadmin/vertica/backup/'}]
{'port_rsync': 50000, 'upNodes': [], 'dbPassword': 'xxxx', 'restorePointLimit': 14, 'total_bwlimit_restore': 0, 'fullTopology': [], 'serviceAccessUser': None, 'dbOptions': '', 'configFileName': '/home/dbadmin/vertica/aws_vertica7_1_x_fullbak1.ini', 'total_bwlimit_backup': 0, 'retryDelay': 1, 'passwordFile': '/home/dbadmin/vertica/aws_vertica7_1_x_fullbak1.pwd', 'dryrun': False, 'optNodes': [], 'sshVertica': ['ssh', '-x', '-o', 'ServerAliveInterval=60'], 'verticaBinDir': '/opt/vertica/bin', 'nodeStates': {}, 'dbUser': 'dbadmin', 'forceLocalRsyncD': False, 'concurrency_restore': 1, 'outputJson': False, 'dbPromptForPassword': False, 'copyCtx': 'backup', 'debug': 3, 'dbInitiator': None, 'rsyncSubPIDList': [], 'scpVertica': ['scp'], 'archiveSpecified': '', 'bwlimit': 0, 'overwrite': True, 'encrypt': False, 'snapshotName': 'aws_vertica7_1_x_fullbak1', 'scpBackup': ['scp'], 'logFileName': None, 'tmpDir': '/tmp/vbr', 'concurrency_backup': 1, 'hardLinkLocal': False, 'serviceAccessPass': None, 'objects': None, 'topology': [{'dbHost': '', 'dbDir': '', 'dbNode': 'v_db_node0001', 'backupHost': '172.30.0.13', 'backupDir': '/home/dbadmin/vertica/backup/'}, {'dbHost': '', 'dbDir': '', 'dbNode': 'v_db_node0002', 'backupHost': '172.30.0.13', 'backupDir': '/home/dbadmin/vertica/backup/'}, {'dbHost': '', 'dbDir': '', 'dbNode': 'v_db_node0003', 'backupHost': '172.30.0.13', 'backupDir': '/home/dbadmin/vertica/backup/'}], 'dbPort': None, 'downNodes': [], 'sshBackup': ['ssh', '-x', '-o', 'ServerAliveInterval=60'], 'checksum': False, 'sqlCached': '', 'retryCount': 2, 'dbName': 'db'}
Preparing...
v_db_node0001 172.30.0.74 /home/dbadmin/db/v_db_node0001_catalog
v_db_node0002 172.30.0.215 /home/dbadmin/db/v_db_node0002_catalog
v_db_node0003 172.30.0.216 /home/dbadmin/db/v_db_node0003_catalog
Found Database port: 5433
DB | Host | State
----+--------------+-------
db | 172.30.0.215 | DOWN
db | 172.30.0.216 | DOWN
db | 172.30.0.74 | DOWN
g["upNodes"]: []
g["downNodes"]: ['v_db_node0002', 'v_db_node0003', 'v_db_node0001']
g["nodeStates"] {'v_db_node0003': 'DOWN', 'v_db_node0002': 'DOWN', 'v_db_node0001': 'DOWN'}
new g["topology"] after filtering: [{'backupDir': '/home/dbadmin/vertica/backup/', 'dbHost': '172.30.0.74', 'dbNode': 'v_db_node0001', 'backupHost': '172.30.0.13', 'dbDir': '/home/dbadmin/db/v_db_node0001_catalog'}, {'backupDir': '/home/dbadmin/vertica/backup/', 'dbHost': '172.30.0.215', 'dbNode': 'v_db_node0002', 'backupHost': '172.30.0.13', 'dbDir': '/home/dbadmin/db/v_db_node0002_catalog'}, {'backupDir': '/home/dbadmin/vertica/backup/', 'dbHost': '172.30.0.216', 'dbNode': 'v_db_node0003', 'backupHost': '172.30.0.13', 'dbDir': '/home/dbadmin/db/v_db_node0003_catalog'}]
Preparing vertica hosts: set(['172.30.0.74', '172.30.0.215', '172.30.0.216'])
Preparing backup hosts: set(['172.30.0.13'])
preparing on host 172.30.0.74...
preparing on host 172.30.0.215...
preparing on host 172.30.0.216...
preparing on host 172.30.0.13...
Recovering...
Copying...
7950 :: 172.30.0.13 /home/dbadmin/vertica/backup/v_db_node0001/aws_vertica7_1_x_fullbak1 172.30.0.74 /home/dbadmin/db/v_db_node0001_catalog
7951 :: 172.30.0.13 /home/dbadmin/vertica/backup/v_db_node0002/aws_vertica7_1_x_fullbak1 172.30.0.215 /home/dbadmin/db/v_db_node0002_catalog
7953 :: 172.30.0.13 /home/dbadmin/vertica/backup/v_db_node0003/aws_vertica7_1_x_fullbak1 172.30.0.216 /home/dbadmin/db/v_db_node0003_catalog
7951: vbr client subproc on 172.30.0.215 terminates with returncode 1. Details in vbr_v_db_node0002_client.log on that host.
Error msg: rsync: failed to connect to 172.30.0.13: Connection refused (111)
rsync error: error in socket IO (code 10) at clientserver.c(122) [Receiver=3.0.7]
rsync failed!
Child processes terminated abnormally.
restore failed!
cleaning up...
7953: vbr client subproc on 172.30.0.216 terminates with returncode 1. Details in vbr_v_db_node0003_client.log on that host.
Error msg: rsync: failed to connect to 172.30.0.13: Connection refused (111)
rsync error: error in socket IO (code 10) at clientserver.c(122) [Receiver=3.0.7]
rsync failed!
7950: vbr client subproc on 172.30.0.74 terminates with returncode 1. Details in vbr_v_db_node0001_client.log on that host.
Error msg: rsync: failed to connect to 172.30.0.13: Connection refused (111)
rsync error: error in socket IO (code 10) at clientserver.c(122) [Receiver=3.0.7]
rsync failed!
[dbadmin@ip-172-30-0-74 vbr]$ cat vbr_v_db_node0001_client.log
2016-08-11 19:57:05 Transfer client process entry: my pid is 9203; task is restore.
2016-08-11 19:57:05 rsyncOptions
2016-08-11 19:57:05 ['/opt/vertica/bin/rsync', '--stats', '--whole-file', '--progress']
2016-08-11 19:57:05 Removing all data under /home/dbadmin/db/v_db_node0001_catalog ...
2016-08-11 19:57:05 Dry-run to find transfer size
2016-08-11 19:57:05 rsync failed with code 10
2016-08-11 19:57:05 rsync failed!
2016-08-11 20:02:57 Transfer client process entry: my pid is 10599; task is restore.
2016-08-11 20:02:57 rsyncOptions
2016-08-11 20:02:57 ['/opt/vertica/bin/rsync', '--stats', '--whole-file', '--progress']
2016-08-11 20:02:57 Removing all data under /home/dbadmin/db/v_db_node0001_catalog ...
2016-08-11 20:02:57 Dry-run to find transfer size
2016-08-11 20:02:57 rsync failed with code 10
2016-08-11 20:02:57 rsync failed!
0
Comments
Before you kick of Copy Cluster please verify you do not have any rsync processes running on both the source and target cluster (on all the nodes). Kill the rsync process if they are.
Also, there are logs in /tmp/vbr that you can check to know the reason why its failing.
Thanks @kaurora ---
I've ensured there were no running rsync commands. I've also rebooted the servers and checked again.
I've ensured all FWs are down and that passworless ssh is working between all the nodes and the backup server, as well as from the backupserver to all nodes.
The error message I'm getting from rsync is generic:
9936: vbr client subproc on 172.30.0.216 terminates with returncode 1. Details in vbr_v_db_node0003_client.log on that host.
Error msg: rsync: failed to connect to 172.30.0.13: Connection refused (111)
rsync error: error in socket IO (code 10) at clientserver.c(122) [Receiver=3.0.7]
rsync failed!
All nodes return the same error.
My backup config is as follows:
[centos@ip-172-30-0-74 vertica]$ cat aws_vertica7_1_x_fullbak1.ini
[Misc]
snapshotName = aws_vertica7_1_x_fullbak1
restorePointLimit = 14
passwordFile = /home/dbadmin/vertica/aws_vertica7_1_x_fullbak1.pwd
[Database]
dbName = db
dbUser = dbadmin
[Transmission]
[Mapping]
v_db_node0001 = 172.30.0.13:/home/dbadmin/vertica/backup/
v_db_node0002 = 172.30.0.13:/home/dbadmin/vertica/backup/
v_db_node0003 = 172.30.0.13:/home/dbadmin/vertica/backup/
Do I need to include the port_rsync parameter or will it default to 500000. Are there any other tests I can run to try and find out why this restore will not work?
Thanks,
Drew
On the backup node I found:
[centos@vert-backup vbr_history_2016-08-12_14-54-53]$ cat vbr_rsyncd.log
2016/08/12 14:10:58 [3388] rsync: failed to create pid file /tmp/vbr/vbr_rsyncd.pid: Broken pipe (32)
2016/08/12 14:10:58 [3388] rsync error: error in file IO (code 11) at clientserver.c(987) [Receiver=3.0.7]
[centos@vert-backup vbr]$ cat vb*.log
2016/08/12 15:06:36 [1952] rsync: failed to create pid file /tmp/vbr/vbr_rsyncd.pid: File exists (17)
2016/08/12 15:06:36 [1952] rsync error: error in file IO (code 11) at clientserver.c(987) [Receiver=3.0.7]
2016/08/12 15:06:36 [1960] rsync: failed to create pid file /tmp/vbr/vbr_rsyncd.pid: File exists (17)
2016/08/12 15:06:36 [1960] rsync error: error in file IO (code 11) at clientserver.c(987) [Receiver=3.0.7]
Yes,
Sorry I was gonna post this earlier. If you have a failed try there might be some files/folder that need to be cleaned up. lock and pid files. good you figured it out.