Network Issue
Hello Everyone,
We are running a 3 node cluster in Azure and last Friday one of our nodes went problematic which caused downtime to our cluster. I looked at the logs and find these messages
2017-05-26 11:38:43.844 AnalyzeRowCount:0x7f9048015dd0 <ERROR> @v_medicaid_node0001: {threadShim} VX001/3483: Got unexpected error code from spread: -18, The network socket experienced an error. This Spread mailbox will no longer work until the connection is disconnected and then reconnected 2017-05-26 11:38:43.868 Init Session:0x7f900403c120-a00000003c2cb3 <ERROR> @v_medicaid_node0001: 08006/4539: Received no response from v_medicaid_node0002, v_medicaid_node0003 in transaction bind 2017-05-26 11:38:43.869 ManageEpochs:0x7f9048016960 <ERROR> @v_medicaid_node0001: {threadShim} VX001/3483: Got unexpected error code from spread: -18, The network socket experienced an error. This Spread mailbox will no longer work until the connection is disconnected and then reconnected 2017-05-26 11:38:43.879 Init Session:0x7f90040462e0-a00000003c2cb5 <ERROR> @v_medicaid_node0001: 08006/4539: Received no response from v_medicaid_node0002, v_medicaid_node0003 in transaction bind 2017-05-26 11:38:43.892 Init Session:0x7f900403c120-a00000003c2cb6 <ERROR> @v_medicaid_node0001: V1003/6876: No nodes up!
spread.log
[Fri 26 May 2017 11:29:19] Pushed eviction timeout back 600.000000s [Fri 26 May 2017 11:31:19] Sess_read: received a heartbeat on 'node_a' ( mailbox 9 ) [Fri 26 May 2017 11:31:19] Pushed eviction timeout back 600.000000s [Fri 26 May 2017 11:33:19] Sess_read: received a heartbeat on 'node_a' ( mailbox 9 ) [Fri 26 May 2017 11:33:19] Pushed eviction timeout back 600.000000s [Fri 26 May 2017 11:35:19] Sess_read: received a heartbeat on 'node_a' ( mailbox 9 ) [Fri 26 May 2017 11:35:19] Pushed eviction timeout back 600.000000s [Fri 26 May 2017 11:37:19] Sess_read: received a heartbeat on 'node_a' ( mailbox 9 ) [Fri 26 May 2017 11:37:19] Pushed eviction timeout back 600.000000s [Fri 26 May 2017 11:39:19] Sess_read: received a heartbeat on 'node_a' ( mailbox 9 ) [Fri 26 May 2017 11:39:19] Pushed eviction timeout back 600.000000s [Fri 26 May 2017 11:41:19] Sess_read: received a heartbeat on 'node_a' ( mailbox 9 ) [Fri 26 May 2017 11:41:19] Pushed eviction timeout back 600.000000s [Fri 26 May 2017 11:43:19] Sess_read: received a heartbeat on 'node_a' ( mailbox 9 ) [Fri 26 May 2017 11:43:19] Pushed eviction timeout back 600.000000s [Fri 26 May 2017 11:45:19] Sess_read: received a heartbeat on 'node_a' ( mailbox 9 ) [Fri 26 May 2017 11:45:19] Pushed eviction timeout back 600.000000s [Fri 26 May 2017 11:47:19] Sess_read: received a heartbeat on 'node_a' ( mailbox 9 ) [Fri 26 May 2017 11:47:19] Pushed eviction timeout back 600.000000s
It looks to me a network issue but I could not verify 100%. Anyone here experienced the same error before?
Comments
Hi Rsalayo,
Yes, we have faced similar kind of problem in our system where the "unexpected error code from spread" is experienced by spread.
We did multiple reboots to vertica cluster but none of our actions helped to solve this case. There are 2 related articles mentioned below which may help you understand this case better. In our case, disks on which vertica was hosted were facing 100% disk IO due to which spread was not able to communicate with other clients causing database to go shutdown/hanging state.
Try checking our OS logs or IO reports to see what is happening with Azure. This problem is related to your OS or have our SystemAdmin to look into your Azure.
https://forum.vertica.com/discussion/209075/cant-initialize-vertica-in-3-nodes-cluster-after-1-machine-down
https://forum.vertica.com/discussion/238042/help-vertica-database-shuts-down-every-10-days
Regards,
Raghav Agrawal
Raghav thanks for the inputs, I'll check the links that you provided
Hi rsalayo,
Mind if I ask about the configuration of your nodes? First, did you use the marketplace deployment of Vertica, or did you build your own? If the later, did you install the cluster using the point to point flag? Can all the nodes see one another (did an IP address change)?
Thanks,
-Chris
you may also install them in an availability set! this will give you better consistency in the network.
Chris,
We are running Vertica Analytic Database v7.2.3-0 and did the installation ourselves, I believe only version 8 is available in Azure Marketplace. I'll have to verify if point to point flag was selected during the installation. there was no IP address changes and I can ping each nodes.
Ckotsidimos,
Not sure what you mean't by 'availability set' but I'll research on that.
@rsalayo in Azure you can create an availability set and put the vms there. Check Azure documentation. '
The point to point flag is critical, I know from experience case I have many Vertica installations in many cloud.
Hello Everyone,
Sorry for the late update. I confirm that we are running point to point on our cluster. Also spread logging was found to be enabled in this 3 node cluster. One of our senior DBA pointed out that spread logging should be disabled unless advised by Vertica support. I also found that during the outage there was a high disk utilization on the node that caused the issue compared to the other nodes. At this point we are thinking of disabling the spread logging on our cluster to resolve the issue.
Any thoughts?
In addition to above, we have been seeing this error on the scrutinize logs. Not exactly sure how it relates to our outage.
Can you check the available free disk space?
?
Here is the result of df -h
node01
Filesystem Size Used Avail Use% Mounted on /dev/sda2 29G 12G 18G 41% / devtmpfs 56G 0 56G 0% /dev tmpfs 56G 0 56G 0% /dev/shm tmpfs 56G 57M 56G 1% /run tmpfs 56G 0 56G 0% /sys/fs/cgroup /dev/sda1 497M 114M 384M 23% /boot /dev/sdc1 1007G 97G 859G 11% /verticadb tmpfs 12G 0 12G 0% /run/user/1002 /dev/sdb1 221G 2.1G 208G 1% /mnt/resource
node02
/dev/sda2 29G 9.0G 20G 32% / devtmpfs 56G 0 56G 0% /dev tmpfs 56G 0 56G 0% /dev/shm tmpfs 56G 49M 56G 1% /run tmpfs 56G 0 56G 0% /sys/fs/cgroup /dev/sda1 497M 76M 422M 16% /boot /dev/sdc1 1007G 23G 934G 3% /verticadb tmpfs 12G 0 12G 0% /run/user/1002 /dev/sdb1 221G 2.1G 208G 1% /mnt/resource
node03
/dev/sda2 29G 9.0G 20G 32% / devtmpfs 56G 0 56G 0% /dev tmpfs 56G 0 56G 0% /dev/shm tmpfs 56G 49M 56G 1% /run tmpfs 56G 0 56G 0% /sys/fs/cgroup /dev/sda1 497M 76M 422M 16% /boot /dev/sdc1 1007G 23G 934G 3% /verticadb tmpfs 12G 0 12G 0% /run/user/1002 /dev/sdb1 221G 2.1G 208G 1% /mnt/resource
We seem to have more than enough free space
Seems as everything is OK. If it stays a one time incident I would say that this was caused by a network timeout.