Options

K-safety==2 but not quite

Hello, I set up my cluster (5 nodes, 6.1.2) with a K-safety of 2. All projections are created with k-safety 2, the design has been marked as k-safe 2, all looked good. I then lost a node. With a k-safety of 2, I did not mid so much, trusting that it should have minimal impact. It then appeared that the server is fully dead, and thus wanted to fully remove it from the cluster, following the documentation, which says to first lower the k-safety. Sadly, trying to mark the design as ksafe 1 fails because some projections, whereas marked as ksafe 2 are actually only ksafe 1 (see [1]). Probably related to this, when running on 2 nodes only, i do see this errors for some queries: ERROR: Insufficient projections to answer query Detail: Path segment called for on node v_spil_dwh_node0005 can not be computed I recently rebalanced the cluster to make sure that all projections were up to date. So my questions are: - how is this even possible? - Is there something to be done to avoid this? This is now a huge production issue. Extra information: [1]: failing to lower a ksafe 2 design to 1
  spil_dwh=>  SELECT MARK_DESIGN_KSAFE(1);  -[ RECORD 1 ]-----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  MARK_DESIGN_KSAFE | Current design does not meet the requirements for K = 1.  Current design is valid for K = 0.  Projection dsl.geoip_name has insufficient buddy projections; it has 0 buddies  Projection dsl.vw_gameplays_reportdate has insufficient buddy projections; it has 0 buddies  Projection dsl.vw_gameplays_username has insufficient buddy projections; it has 0 buddies    
Offending projection:
  spil_dwh=> select export_objects('', ' dsl.geoip_name');  -[ RECORD 1 ]--+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  export_objects |  CREATE PROJECTION dsl.geoip_name /*+createtype(D)*/  (   id ENCODING DELTARANGE_COMP,   begin_ip,   end_ip,   begin_num ENCODING DELTARANGE_COMP,   end_num ENCODING DELTARANGE_COMP,   country ENCODING RLE,   name  )  AS   SELECT geoip.id,          geoip.begin_ip,          geoip.end_ip,          geoip.begin_num,          geoip.end_num,          geoip.country,          geoip.name   FROM dsl.geoip   ORDER BY geoip.name  SEGMENTED BY hash(geoip.name) ALL NODES OFFSET 2;      SELECT MARK_DESIGN_KSAFE(2);  

Comments

  • Options
    Dropping the projections does allow me to lower the ksafety of the design, and I should be able to carry on. i still wonder how this kind of things could happen and how to avoid them.
  • Options
    Hi Guillaume,

    Was dsl.geoip_name created recently or is it empty.
    As superprojections are created only after data is inserted.

    Did you face the same issue after rebalancing the data also?k

    Thanks,
    Vivek
  • Options
    Hi Guillaume,

    Please note that a K-safety value of 2 is not recommended. Vertica 5.0 and later versions have an intelligent node loss feature that tracks critical nodes. As long as the database has a K-safety value of 1 and all the data is accessible somewhere, the cluster will stay up. K-safety = 2 introduces performance and space issues. It requires 3 copies instead of 2 copies of all the data, so loads take longer. And the additional data can cause delays during the recovery of a down node, an during re balancing when adding or removing a node, backups, etc. K-safety=1 is the recommended K-safety value.
    We tell anyone who is running K-safety = 2 to lower it to K-safety = 1.
  • Options
    ..
  • Options
    ...
  • Options
    So, if I have a K-Safe=1 value, could I have a 4-node cluster and still have my cluster up and running with only 2 nodes?
  • Options
    So, if I have a K-Safe=1 value, could I have a 4-node cluster and still have my cluster up and running with only 2 nodes?
  • Options
    So, if I have a K-Safe=1 value, could I have a 4-node cluster and still have my cluster up and running with only 2 nodes?
  • Options
    No, but you actually can't do that with K=2 either.

    You could have a 5-node cluster and still have your K=1 cluster up and running with two nodes down, depending on which nodes failed.

    There's another rule to K-safety:  No matter how many nodes you have, the cluster will only stay up as long as *more than half* of the nodes are visible.

    This is to prevent cluster segmentation:  What if you had a 4-node cluster with two nodes each in two racks, then you cut the Ethernet cable running between the two racks?  Then in one rack you would have two nodes, which could stay up; and in the other rack you would have two nodes, which might also be able to stay up, depending on the exact organization of your data.  So basically you have two clusters.  What happens if you then modify a bunch of data in one cluster, and modify the same data differently in the other cluster, and then plug the two back together?  We would have no way to reconcile the differences.  So we don't let that happen -- if a cluster gets segmented and doesn't have a majority of nodes, even if all the data is available, it immediately shuts down to prevent cluster segmentation.

    This is, incidentally, why we recommend a minimum cluster size of 3 nodes.  A 2-node cluster can't run off of one node for the same reason a 4-node cluster can't safely run off of 2 nodes.
  • Options
    To answer your original question about how this sort of thing can happen:  The K-safety setting is a forward-looking knob.  It affects all future tables and all future rebalances, but it does not automatically force all existing data to the new K-safety level.  This is because increasing K-safety requires populating a new projection; this is an expensive operation that you might want to defer to a maintenance window (while having new tables be K-safe immediately), and one that you might want to have some manual control over, designing new projections to match a particular query workload.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file