Node down in 18 node cluster
One of our node in 18 node cluster is down due to some os issues. Although we have identied the problem
but till then we are operational on 17 node cluster with K safety 1.
I have some follow up questions -
1)Will their be data loss if the buddy node also goes down.
2)Does Rebalance makes sense for 17 nodes without removing the down node as our high load etl are getting stucked?
Best Answers
-
mosheg Vertica Employee Administrator
Q - Is there a way when one node is down to get the other nodes function optimally?
A - Yes, open a support case and investigate the root cause for your performance issue.
There are several things support may suggest, check how your data is distributed,
Upgrade to latest relevant version, because the use of swap partition between tables could fail in old Vertica versions,
Or change the way you use Temporary Tables, use EON, etc.Q - Is it possible to remove down node?
A - Support can help you with this, and you can use a standby node in the cluster to replace a down node.
See: https://www.vertica.com/docs/10.1.x/HTML/Content/Authoring/SQLReferenceManual/Statements/ALTERNODE.htm
For instructions how to replace a permanently down node see: https://forum.vertica.com/discussion/241043/cannot-replace-a-permanently-down-node-update-vertica-fails-with-host-is-unreachableQ - Increasing from Ksafe 1 to Ksafe 2 will keep data under license compliance?
A - Yes, but in your case it is advised to stay with Ksafe 1, because Ksafe 2 will increase the amount of I/O and the required disk space.
Increasing the number of projections does not influence license consumption.
You can create active standby nodes as explained here: https://vertica.com/docs/10.1.x/HTML/Content/Authoring/AdministratorsGuide/ManageNodes/CreatingAnActiveStandbyNode.htmQ - Removing a node from the cluster requires downtime?
A - It does not require downtime.
See: https://forum.vertica.com/discussion/241043/cannot-replace-a-permanently-down-node-update-vertica-fails-with-host-is-unreachable1 -
mosheg Vertica Employee Administrator
Thank you Joseph for your kind words.
Q - When you say check how your data is distributed ,you mean check segmentations, rebalance and data skew?
A - Yes.
Q - As per Sumeet's link we need to have a standby node ready in different single node cluster to tackle this situation,
however as per doc we can also achieve the same keeping one standby in our 18 node cluster...right?
A - You can reconfigure one of the nodes to play as standby, but if your cluster sizing satisfy your workload performance and available disk-space, it is advised to add a new node for standby.0
Answers
Q - Will their be data loss if the buddy node also goes down?
A - Committed data will not lost.
While node 18 is down, if also the buddy node will go down the database is considered unsafe and automatically shuts down.
Q - Does Rebalance makes sense for 17 nodes without removing the down node as our high load etl are getting stucked?
A - No, it is advised to create a Standby Node
See: https://www.vertica.com/docs/10.1.x/HTML/Content/Authoring/AdministratorsGuide/ManageNodes/CreatingAnActiveStandbyNode.htm
Thanks @mosheg for clarifying above questions-
I have some more questions,please help.
1)Node was down due to issues from os side.Now we have manged to bring it back.
But what we have observed that when the node is down high load etls and other reporting are stuck on other 17 nodes.
Is there any way when one node is down and till it gets resolved other 17 nodes function optimally so that our business is not impacted.
2)Also is it possible to remove down node?
3)We have Ksafety 1 and 20 TB licensce allowed. We use almost 18 TB of the allowed license.
I suppose increasing Ksafety will not be possible if we want to keep the data under compliance norms?
Also one more additional question-
If in future we suspect problem of os on any node snd before its going down we wan to remove it from cluster and rebalance data.
Will this process require downtime from all loads,etls,reporting and end users in DB.
Thanks @mosheg you suggestions and suggested links gonna make our life easy.
I am definately going to open a support cas for further evaluation.
Just 2 last questions--a) when you say check how your data is disributed ,you mean check segmentations, rebalance and data skew? (and we use vertica 9.1 ...i think it is good enough mate).
b)As per Sumeet's link we need to have a standby node ready in different single node cluster to tackle this situation,
however as per doc we can also achieve the same keeping one standby in our 18 node cluster...right?
Thanks @mosheg