How to restore a full backup on Kubernetes operator based setup ?

edited January 6 in General Discussion

Synopsis

Hello,
Currently we're having a small issue, we have a first Kubernetes cluster that hold Vertica setup that works just fine using communal storage on the other hand we need to create a new setup based on the old data, so we did a full backup using vbr.py script which did create all objects into AWS object storage, so far so good, our issue starts when trying to restore data using the right bucket, as the database needs to be shutdown beforehand, in which case the readiness probe fails because of its exec command vsql -w $(cat /etc/podinfo/superuser-passwd) -c 'select 1'. Tried to check if the operator exposes the readiness probe through its CustomRessource verticadbs, but it seems no such property is possible. Thought also about changing manually the readiness probe but again it goes against what the operator stands for with its reconciliation loop.

My question is the following: How can we restore the database from a full backup using vbr script on a Kubernetes based setup ?

References:

vertica version : v12.0.3-0
vertica's kubernetes operator : 1.10.2
Kubernetes version: 1.28.2
VerticaDB C.R version: vertica.com/v1beta1

Thank you,

Answers

  • After reading the operator's source code I found a way to override the readiness probe, still interested in knowing the best way of doing a full restoration on a Kubernetes based setup.

  • Actually even though the readiness/startup probes are overridden nothing changes, when I play the SELECT SHUTDOWN(); everything goes down, and pods restarts everywhere, which is normal, but then again, how can we do a restoration in such case ?

  • Bryan_HBryan_H Vertica Employee Administrator

    There is a restore process in the documentation at https://docs.vertica.com/12.0.x/en/containerized/backup-restore/#restore-from-a-backup that describes the process including extending the livenessProbe.

  • edited January 7

    Thank you for the insight, indeed it works, did a full restoration but the issue is the following command:

    $ /opt/vertica/bin/admintools -t re_ip --file=/opt/vertica/config/ipMap.txt --noprompt --force 
    

    that fails all the time, as the operator tries to run it on all nodes but nothing gets applied because of the nodes state, it throws a rollback error at the end:

    Writing new settings to the catalogs of database XXXX ...
    The catalog change was not applied to the following nodes:
        v_vertdb_node0001: PrepareFailed
        v_vertdb_node0004: PrepareFailed
        v_vertdb_node0006: PrepareFailed
        v_vertdb_node0002: PrepareFailed
        v_vertdb_node0010: PrepareFailed
        v_vertdb_node0011: PrepareFailed
        v_vertdb_node0012: PrepareFailed
        v_vertdb_node0003: PrepareFailed
        v_vertdb_node0008: PrepareFailed
        v_vertdb_node0007: PrepareFailed
        v_vertdb_node0005: PrepareFailed
        v_vertdb_node0009: PrepareFailed
    Failed. Catalog change not committed on a quorum of nodes or there are nodes in unsafe states.
    
    All databases changes failed. Rolling back all the changes ...
    Rolledback.
    

    Is there something else we need to do ?

  • Actually I have a better understanding of what's going on, I think the cluster needs still access to old IPs, the fact that we can't run a re_map is because it's calling old IPs. Even though I ran this command to set the redirection

    $ iptables -t nat -A OUTPUT -d OLD_IP -j DNAT --to-destination NEW_IP
    

    Which seem to at least help start the process, it still it fails for the same reason:

    Writing new settings to the catalogs of database XXXX ...
    The catalog change was not applied to the following nodes:
        v_vertdb_node0001: PrepareFailed
        v_vertdb_node0004: PrepareFailed
        v_vertdb_node0006: PrepareFailed
        v_vertdb_node0002: PrepareFailed
        v_vertdb_node0010: PrepareFailed
        v_vertdb_node0011: PrepareFailed
        v_vertdb_node0012: PrepareFailed
        v_vertdb_node0003: PrepareFailed
        v_vertdb_node0008: PrepareFailed
        v_vertdb_node0007: PrepareFailed
        v_vertdb_node0005: PrepareFailed
        v_vertdb_node0009: PrepareFailed
    Failed. Catalog change not committed on a quorum of nodes or there are nodes in unsafe states.
    
    All databases changes failed. Rolling back all the changes ...
    Rolledback.
    
    
      
        
          No changes applied.
        
      
    
    
    
  • Bryan_HBryan_H Vertica Employee Administrator

    Re-ip should not be necessary with Eon mode database restore, it should happen automatically if the primary node count is correct, as we expect node IP's might change in an Eon mode cluster. Does the database start anyways if you configure the resource to start from the restore location? If not, please open a ticket if possible, and I'll see if support can look at this during business hours.

  • If you mean changing restore location to the one its meant to be used, I didn't change anything, left as the script did it, which is exactly the same as the original backup /data for data and /depot for depot. Just to make sure I checked the configuration file metadata/DATABASE_NAME/cluster_config.json

    // jq ".Node[0]" cluster_config.json
          {
             "address" : "XX.XX.XX.XX",
             "addressFamily" : "ipv4",
             "catalogPath" : "/data/vertdb/v_vertdb_node0007_catalog/Catalog",
             "clientPort" : 5433,
             "controlAddress" : "XX.XX.XX.XX",
             "controlAddressFamily" : "ipv4",
             "controlBroadcast" : "XX.XX.XX.XX",
             "controlNode" : 45035996512324596,
             "controlPort" : 4803,
             "ei_address" : 0,
             "hasCatalog" : false,
             "isEphemeral" : false,
             "isPrimary" : true,
             "isRecoveryClerk" : false,
             "name" : "v_vertdb_node0007",
             "nodeParamMap" : [],
             "nodeType" : 0,
             "oid" : 45035996512324596,
             "parentFaultGroupId" : 45035996273704980,
             "replacedNode" : 0,
             "schema" : 0,
             "siteUniqueID" : 16,
             "tag" : 0
          }
    

    Also tried to create support ticket here but it seems I don't have enough rights, can you help ?

    Thanks a lot,

  • Bryan_HBryan_H Vertica Employee Administrator

    I am not able to change settings on the support portal, please contact your AE or field team for help there.

  • Thanks, already did it, still waiting to get access to support portal so I could create the ticket

  • spilchenspilchen Employee

    Are you running the re_ip manually? Or is this what you see when the operator tries to run it. The output from the re_ip doesn't tell us much. We may need to look at the contents of /opt/vertica/log/adminTools.log to see why it failed.

  • No it's what the operator is trying to do in an infinite loop, circling through all nodes over and over. I attached adminTools.log of one of the nodes, let me know if you need something else.

  • spilchenspilchen Employee

    What is the image that you have current deployed in your cluster? Is it the same version of vertica that the backup was taken from?

    The adminTools.log had this error at each of the nodes "*** Core dump before operational or shutting down". We will need to collect scrutinize and investigate other logs. So, best to handle this through a ticket with our support org.

  • Sorry for the late response, exactly the same, nothing has changed. And yes already created a support ticket, waiting for response

    Thank you,

  • SruthiASruthiA Administrator

    @domidunas : Could you please share the error you are experiencing?

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file