K-safety==2 but not quite
Hello, I set up my cluster (5 nodes, 6.1.2) with a K-safety of 2. All projections are created with k-safety 2, the design has been marked as k-safe 2, all looked good. I then lost a node. With a k-safety of 2, I did not mid so much, trusting that it should have minimal impact. It then appeared that the server is fully dead, and thus wanted to fully remove it from the cluster, following the documentation, which says to first lower the k-safety. Sadly, trying to mark the design as ksafe 1 fails because some projections, whereas marked as ksafe 2 are actually only ksafe 1 (see [1]). Probably related to this, when running on 2 nodes only, i do see this errors for some queries: ERROR: Insufficient projections to answer query Detail: Path segment called for on node v_spil_dwh_node0005 can not be computed I recently rebalanced the cluster to make sure that all projections were up to date. So my questions are: - how is this even possible? - Is there something to be done to avoid this? This is now a huge production issue. Extra information: [1]: failing to lower a ksafe 2 design to 1
spil_dwh=> SELECT MARK_DESIGN_KSAFE(1); -[ RECORD 1 ]-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- MARK_DESIGN_KSAFE | Current design does not meet the requirements for K = 1. Current design is valid for K = 0. Projection dsl.geoip_name has insufficient buddy projections; it has 0 buddies Projection dsl.vw_gameplays_reportdate has insufficient buddy projections; it has 0 buddies Projection dsl.vw_gameplays_username has insufficient buddy projections; it has 0 buddiesOffending projection:
spil_dwh=> select export_objects('', ' dsl.geoip_name'); -[ RECORD 1 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- export_objects | CREATE PROJECTION dsl.geoip_name /*+createtype(D)*/ ( id ENCODING DELTARANGE_COMP, begin_ip, end_ip, begin_num ENCODING DELTARANGE_COMP, end_num ENCODING DELTARANGE_COMP, country ENCODING RLE, name ) AS SELECT geoip.id, geoip.begin_ip, geoip.end_ip, geoip.begin_num, geoip.end_num, geoip.country, geoip.name FROM dsl.geoip ORDER BY geoip.name SEGMENTED BY hash(geoip.name) ALL NODES OFFSET 2; SELECT MARK_DESIGN_KSAFE(2);
0
Comments
Was dsl.geoip_name created recently or is it empty.
As superprojections are created only after data is inserted.
Did you face the same issue after rebalancing the data also?k
Thanks,
Vivek
Please note that a K-safety value of 2 is not recommended. Vertica 5.0 and later versions have an intelligent node loss feature that tracks critical nodes. As long as the database has a K-safety value of 1 and all the data is accessible somewhere, the cluster will stay up. K-safety = 2 introduces performance and space issues. It requires 3 copies instead of 2 copies of all the data, so loads take longer. And the additional data can cause delays during the recovery of a down node, an during re balancing when adding or removing a node, backups, etc. K-safety=1 is the recommended K-safety value.
We tell anyone who is running K-safety = 2 to lower it to K-safety = 1.
You could have a 5-node cluster and still have your K=1 cluster up and running with two nodes down, depending on which nodes failed.
There's another rule to K-safety: No matter how many nodes you have, the cluster will only stay up as long as *more than half* of the nodes are visible.
This is to prevent cluster segmentation: What if you had a 4-node cluster with two nodes each in two racks, then you cut the Ethernet cable running between the two racks? Then in one rack you would have two nodes, which could stay up; and in the other rack you would have two nodes, which might also be able to stay up, depending on the exact organization of your data. So basically you have two clusters. What happens if you then modify a bunch of data in one cluster, and modify the same data differently in the other cluster, and then plug the two back together? We would have no way to reconcile the differences. So we don't let that happen -- if a cluster gets segmented and doesn't have a majority of nodes, even if all the data is available, it immediately shuts down to prevent cluster segmentation.
This is, incidentally, why we recommend a minimum cluster size of 3 nodes. A 2-node cluster can't run off of one node for the same reason a 4-node cluster can't safely run off of 2 nodes.