Replication failing for bunch of objects/schemas.
Vertica 9.2.1
We are doing object level and schema level replication under the same job, what this job does is it pulls the list of objects and schemas to replicate for the current week and based on that list replication script will trigger for one by one object/schema. For most of the objects and schemas, replication
is getting completed successfully but it is failing for bunch of objects/schemas with the below errors I have reported earlier.
I have seen the solution given in google when I searched for connection reset by peer error in web, as I said our replication job is only failing for some of the objects/schemas and not for all, so I can't kill rsync deamon everytime on all 56 nodes and rerun the job, if I do that it would be an ever ending process. I have noticed vbr checks for rsync deamon before triggering the replication, if rsync deamon is already running, it is saying rsync deamon is already running and triggering the replication without any issues.
++
Error: On host 172.29.70.74: Error accessing remote storage: failed accessing remote storage on 130.6.145.5: rsync: read error: Connection reset by peer (104) rsync error: error in rsync protocol data stream (code 12) at io.c(760) [Receiver=3.0.7]
How do we prevent the below errors related to .gt files?
Error: On host 172.29.70.89: Error accessing remote storage: failed accessing remote storage on 130.6.145.22: rsync: link_stat "/data/VDW/v_vdw_node0022_data/143/0242a9698186283c174b7c8ccbd7682001300003fc8a9def_0.gt" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1042) [sender=3.0.7]
Error: On host 172.29.70.133: Error accessing remote storage: failed accessing remote storage on 130.6.145.53: file has vanished: "/data/VDW/v_vdw_node0053_data/577/02bf9a9472aeee340c4eb31174ea48b702c00003fc3ec801_0.gt"Error: On host 172.29.70.133: Error accessing remote storage: failed accessing remote storage on 130.6.145.53: file has vanished: "/data/VDW/v_vdw_node0053_data/577/02bf9a9472aeee340c4eb31174ea48b702c00003fc3ec801
The odd thing here is that this configuration and process has been in use by them for a long time and we have excluded permissions, connectivity and inexistent location as possible reasons.
We have enough space in the target cluster. I don't think its a space issue too.
Answers
@Joseph, is the remote file is unmounted during the operation. Could be an NFS issue. Check the remote path exists that you log while it executes. Hope this helps to further troubleshoot your issue.