rsync error: error in socket IO (code 10)

Hi,

I am trying to backup my database using vbr.py script onto remote servers. I set up passwordless ssh connection for dbadmin user between the production server and backup server and ssh works fine, but the backup is failing with the following error

rsync: failed to connect to 15.224.232.169: Connection timed out (110)
rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.7]
rsync failed!

53961: vbr client subproc on 1.1.54.3 terminates with returncode 1. Details in vbr_v_verprd1_node0002_client.log on that host.
rsync: failed to connect to 15.224.232.168: Connection timed out (110)
rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.7]
rsync failed!

Child processes terminated abnormally.
backup failed!


Some of the files are being copied and it suddenly terminates. Can you please help me out on why this is happening and whats the fix for it.

Thanks
Saumya

Comments

  • Abhishek_RanaAbhishek_Rana Vertica Employee Employee
    Hi,

    Please verify that you have password less ssh access to all nodes in the cluster and also password less ssh access to self node. 


    Missing password less ssh to self node is often the problem.
    Missing password less ssh on one of the nodes could be causing this error. 

    The vbr.py uses an rsync port of 50000  check that this port is open


    Rsync by default uses port 873

    To test open for open ports:

    1. On each node including node 1 make it a listening node for port <number>  activity using 

    nc -l 873 

    Each node will go into listen mode and wait for remote input across the port.

    2. On another session on node 1 make it a sending node and send a message over port <number>

    nc nodename 873 

    3. Where nodename is the hostname or IP Address of node 1, it will go into send mode waiting for you to input keystrokes.
      Entering text and hitting return should push the text across the port and display on the listening machine. 

    CTRL C gets you out. 


    Repeat for each node to ensure the port on all nodes from the initiator node are open.


    Other Vertica Ports


    Vertica
    5433 TCP (All connections)
    Spread
    4803 TCP (Client connections)
    4803 UDP (Daemon <-> Daemon)
    4804 UDP (Daemon <-> Daemon)
    4805 UDP (Monitor to Daemon) (optional and only if "DangerousMonitor = yes" in config file)

    Regards'

    Abhishek
  • Hi,

    The ssh connection passwordless is fine from all the production servers to the backup remote hosts.... i do see that the backup directory on the remote host is created and some files copied.

    However, the ssh passwordless from the backup host to production server is not working.. can that be an issue? is that required or if we just have from production to backup server that is enough?

    Thanks
    Saumya

  • Hi,

    The ssh is setup between both the servers and the ports are opened too now but still it fails with the same issue.

    Please let me know what is wrong here.

    Thanks
    saumya
  • When i try the command rsync directly it works fine without issue but the backup always fails with the error. please help.

    ############
    rsync manually
    ############

    [dbadmin@msast001pvdb01 config]$ rsync -avr --rsh=/usr/bin/ssh /opt/vertica/config/prod_backup_remote.ini VERRWCSTDB01:/opt/vertica/config/


    sending incremental file list
    prod_backup_remote.ini

    sent 666 bytes  received 31 bytes  464.67 bytes/sec
    total size is 557  speedup is 0.80

    ############
    Error during backup:
    ############

    rsync: failed to connect to 192.168.201.202: Connection timed out (110)
    rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.7]
    rsync failed!

    Child processes terminated abnormally.




  • Can  you try changing the ports rsync is utilizing and then try again?
  • Hi Karan,

    how do i do that?

    Thanks
    Saumya
  • Are you not using a config file? 
  • Hi karan,

    I tried with both the ports 873 and 50000. When using 873 I get the error message as connection refused and when using 50000 it gives error of connection time out. Please let me know how to fix this. We do not have backups since a long time and really need this running.

    ## when using 873


    rsync: failed to connect to 192.168.201.204: Connection refused (111)
    rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.7]
    rsync failed!


    [root@VERRWCSTDB01 ~]# cat /tmp/vbr/vbr_rsyncd.log
    2014/04/01 11:44:19 [28758] rsyncd version 3.0.7 starting, listening on port 873
    2014/04/01 11:44:19 [28758] bind() failed: Permission denied (address-family 2)
    2014/04/01 11:44:19 [28758] socket(10,1,6) failed: Address family not supported by protocol
    2014/04/01 11:44:19 [28758] unable to bind any inbound sockets on port 873
    2014/04/01 11:44:19 [28758] rsync error: error in socket IO (code 10) at socket.c(541) [Receiver=3.0.7]
    2014/04/01 11:48:46 [29388] rsyncd version 3.0.7 starting, listening on port 873
    2014/04/01 11:48:46 [29388] bind() failed: Permission denied (address-family 2)
    2014/04/01 11:48:46 [29388] socket(10,1,6) failed: Address family not supported by protocol
    2014/04/01 11:48:46 [29388] unable to bind any inbound sockets on port 873
    2014/04/01 11:48:46 [29388] rsync error: error in socket IO (code 10) at socket.c(541) [Receiver=3.0.7]

    ### When using 50000

    rsync: failed to connect to 192.168.201.202: Connection timed out (110)
    rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.7]
    rsync failed!



    Thanks
    Saumya
  • Also, running the rsync command manually for the data directory works fine without issue. only when using vbr.py it is failing, why is this the case?


    [dbadmin@msast001pvdb01 ~]$ ps -ef |grep rsync
    dbadmin  11250     1  0 Jan02 ?        00:00:00 /opt/vertica/bin/rsync --daemon --config=/tmp/vbr/vbr_rsyncd.conf --port=50000
    [dbadmin@msast001pvdb01 ~]$ /opt/vertica/bin/rsync -avz /data/backups dbadmin@VERRWCSTDB01:/data/backups
    sending incremental file list
    backups/
    backups/v_verprd1_node0001/
    backups/v_verprd1_node0001/.production_backup_test.done/
    backups/v_verprd1_node0001/production_backup_test/
    backups/v_verprd1_node0001/production_backup_test/production_backup_test.info
    backups/v_verprd1_node0001/production_backup_test/production_backup_test.txt
    backups/v_verprd1_node0001/production_backup_test/catalog/
    backups/v_verprd1_node0001/production_backup_test/catalog/VERPRD1/
    backups/v_verprd1_node0001/production_backup_test/catalog/VERPRD1/v_verprd1_node0001_catalog/
    backups/v_verprd1_node0001/production_backup_test/catalog/VERPRD1/v_verprd1_node0001_catalog/vertica.conf
    backups/v_verprd1_node0001/production_backup_test/catalog/VERPRD1/v_verprd1_node0001_catalog/Snapshots/
    backups/v_verprd1_node0001/production_backup_test/catalog/VERPRD1/v_verprd1_node0001_catalog/Snapshots/catalog.ctlg
    backups/v_verprd1_node0001/production_backup_test/data/
    backups/v_verprd1_node0001/production_backup_test/data/VERPRD1/
    backups/v_verprd1_node0001/production_backup_test/data/VERPRD1/v_verprd1_node0001_data/

  • Do you test the ports just before you run the backup utility and right after you manually test them to see if its still being used?  
  • Karan,

    Yes, I checked the netstat on 50000 port when its running the backup using vbr.py and see that its opened as i do see the message below in the log


    [root@VERRWCSTDB01 vbr]# tail -f vbr_rsyncd.log
    2014/04/01 15:01:21 [20252] rsyncd version 3.0.7 starting, listening on port 50000

    And while the backup is running i do see that the port 873 is being used by rsync on all the nodes so that is open too



    egrep rsync /etc/services
    rsync           873/tcp                         # rsync
    rsync           873/udp                         # rsync
    airsync         2175/tcp                # Microsoft Desktop AirSync Protocol
    airsync         2175/udp                # Microsoft Desktop AirSync Protocol


    Thanks
    Saumya

  • Can you try running on a different port like 50001 or higher and not test it manually before you try the vertica backup script?
  • Karan,

    I tried with two ports 50001 and 50002 and its the same error

    rsync: failed to connect to 192.168.201.202: Connection timed out (110)
    rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.7]
    rsync failed!

    Child processes terminated abnormally.
    backup failed!


    On production host:

    [dbadmin@msast001pvdb01 config]$ cat //tmp/vbr/vbr26212.log
    2014-04-02 09:33:27 Helper cancelling process entry
    2014-04-02 09:33:27 ps aux | grep /tmp/vbr/vbr.py | grep -v grep | grep client
    [dbadmin@msast001pvdb01 config]$ cat //tmp/vbr/vbr_v_verprd1_node0001_client.log
    2014-04-02 09:26:52 Transfer client process entry: my pid is 13506; task is backup.
    2014-04-02 09:26:52 Read lock acquired on .ctlg file
    2014-04-02 09:26:52 linking/copying special files at client: .ctlg, .txt, .conf
    2014-04-02 09:26:53 Dry-run to find transfer size
    2014-04-02 09:27:56 rsync failed with code 10
    2014-04-02 09:27:56 rsync failed!
    2014-04-02 09:32:23 Transfer client process entry: my pid is 23333; task is backup.
    2014-04-02 09:32:23 Read lock acquired on .ctlg file
    2014-04-02 09:32:23 linking/copying special files at client: .ctlg, .txt, .conf
    2014-04-02 09:32:24 Dry-run to find transfer size
    2014-04-02 09:33:27 rsync failed with code 10
    2014-04-02 09:33:27 rsync failed!
    [dbadmin@msast001pvdb01 config]$


    On remote host:

    [root@VERRWCSTDB01 vbr]# cat vbr_v_verprd1_node0001_server.log
    2014-04-02 09:27:00 Transfer Server process entry: my pid is 17560; task is backup.
    2014-04-02 09:27:00 Acquiring remoteServer mutex
    2014-04-02 09:27:00 Acquired remoteServer mutex
    2014-04-02 09:27:00 ps aux | grep rsync | grep -v grep | grep daemon | grep port=50001
    2014-04-02 09:27:00
    2014-04-02 09:27:00 Rsync daemon is now running
    2014-04-02 09:27:00 Released mutex
    2014-04-02 09:27:00 Transfer Server process exit
    2014-04-02 09:32:31 Transfer Server process entry: my pid is 18329; task is backup.
    2014-04-02 09:32:31 Acquiring remoteServer mutex
    2014-04-02 09:32:31 Acquired remoteServer mutex
    2014-04-02 09:32:31 ps aux | grep rsync | grep -v grep | grep daemon | grep port=50001
    2014-04-02 09:32:31 dbadmin  17578  0.0  0.0 107680   656 ?        Ss   09:27   0:00 /opt/vertica/bin/rsync --daemon --config=/tmp/vbr/vbr_rsyncd.conf --port=50001
    2014-04-02 09:32:31 Rsync daemon is already running
    2014-04-02 09:32:31 Released mutex
    2014-04-02 09:32:31 Transfer Server process exit
    [root@VERRWCSTDB01 vbr]# cat vbr_rsyncd.log
    2014/04/02 09:27:00 [17578] rsyncd version 3.0.7 starting, listening on port 50001

    Thanks
    saumya



  • Can you verify the following? 

    - Your rsync isnt running the minimum required version for Vertica. 
    - You can do passwordless ssh back and forth using root and dbadmin
    - No rsync daemons are running when you try to perform the backup. 

    I shall post more ideas if I can think of something. else. 

  • Karan,

    The answer is yes for all. I verified all those again and they look fine.

    Thanks
    saumya
  • Is that true for all your nodes or just the node you are running on? I had a similar issue in the past but that was cause of one of the nodes not having the  passwordless ssh.
  • Karan,

    Its true for all the nodes that I am running it on.

    Thanks
    Saumya
  • Hi,

    Is there any update on this issue. is there anything else that I am missing?

    Thanks
    Saumya
  • This issue is fixed. as the backup was running across different datacenters there was something in the firewall that was blocking it.

    Now its running fine.

    Thanks
    Saumya
  • Good to know. 
  • Hi Soumya,

    We are facing same issue. Can you please let me know what changes you made in firewall to make it work?

    Thanks!!
  • I am having the same issue, do you remember the firewall changes.

     

  • I know this is an old post, but I had the same issue, and wanted to post what I did to remediate. I had two separate clusters in two separate datacenters and was attempting to use the vbr script to replicate some objects from one cluster to the other. The source cluster is running off dedicated servers, while the target cluster was spun up in AWS. I was getting the same connection time out error, and troubleshooted with the vbr logs (/tmp/vbr).

    The fix was to add firewall rules to my target AWS cluster to allow TCP connections over port 50000 from my source cluster. Hope this helps anyone who stumbles across this.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file