How to restore a full backup on Kubernetes operator based setup ?

hrouineb · January 2024

Synopsis

Hello,
Currently we're having a small issue, we have a first Kubernetes cluster that hold Vertica setup that works just fine using communal storage on the other hand we need to create a new setup based on the old data, so we did a full backup using vbr.py script which did create all objects into AWS object storage, so far so good, our issue starts when trying to restore data using the right bucket, as the database needs to be shutdown beforehand, in which case the readiness probe fails because of its exec command vsql -w $(cat /etc/podinfo/superuser-passwd) -c 'select 1'. Tried to check if the operator exposes the readiness probe through its CustomRessource verticadbs, but it seems no such property is possible. Thought also about changing manually the readiness probe but again it goes against what the operator stands for with its reconciliation loop.

My question is the following: How can we restore the database from a full backup using vbr script on a Kubernetes based setup ?

References:

vertica version : v12.0.3-0
vertica's kubernetes operator : 1.10.2
Kubernetes version: 1.28.2
VerticaDB C.R version: vertica.com/v1beta1

Thank you,

hrouineb · January 2024

After reading the operator's source code I found a way to override the readiness probe, still interested in knowing the best way of doing a full restoration on a Kubernetes based setup.

hrouineb · January 2024

Actually even though the readiness/startup probes are overridden nothing changes, when I play the SELECT SHUTDOWN(); everything goes down, and pods restarts everywhere, which is normal, but then again, how can we do a restoration in such case ?

Bryan_H · January 2024

There is a restore process in the documentation at https://docs.vertica.com/12.0.x/en/containerized/backup-restore/#restore-from-a-backup that describes the process including extending the livenessProbe.

hrouineb · January 2024

Thank you for the insight, indeed it works, did a full restoration but the issue is the following command:

$ /opt/vertica/bin/admintools -t re_ip --file=/opt/vertica/config/ipMap.txt --noprompt --force

that fails all the time, as the operator tries to run it on all nodes but nothing gets applied because of the nodes state, it throws a rollback error at the end:

Writing new settings to the catalogs of database XXXX ...
The catalog change was not applied to the following nodes:
    v_vertdb_node0001: PrepareFailed
    v_vertdb_node0004: PrepareFailed
    v_vertdb_node0006: PrepareFailed
    v_vertdb_node0002: PrepareFailed
    v_vertdb_node0010: PrepareFailed
    v_vertdb_node0011: PrepareFailed
    v_vertdb_node0012: PrepareFailed
    v_vertdb_node0003: PrepareFailed
    v_vertdb_node0008: PrepareFailed
    v_vertdb_node0007: PrepareFailed
    v_vertdb_node0005: PrepareFailed
    v_vertdb_node0009: PrepareFailed
Failed. Catalog change not committed on a quorum of nodes or there are nodes in unsafe states.

All databases changes failed. Rolling back all the changes ...
Rolledback.

Is there something else we need to do ?

hrouineb · January 2024

Actually I have a better understanding of what's going on, I think the cluster needs still access to old IPs, the fact that we can't run a re_map is because it's calling old IPs. Even though I ran this command to set the redirection

$ iptables -t nat -A OUTPUT -d OLD_IP -j DNAT --to-destination NEW_IP

Which seem to at least help start the process, it still it fails for the same reason:

Writing new settings to the catalogs of database XXXX ...
The catalog change was not applied to the following nodes:
    v_vertdb_node0001: PrepareFailed
    v_vertdb_node0004: PrepareFailed
    v_vertdb_node0006: PrepareFailed
    v_vertdb_node0002: PrepareFailed
    v_vertdb_node0010: PrepareFailed
    v_vertdb_node0011: PrepareFailed
    v_vertdb_node0012: PrepareFailed
    v_vertdb_node0003: PrepareFailed
    v_vertdb_node0008: PrepareFailed
    v_vertdb_node0007: PrepareFailed
    v_vertdb_node0005: PrepareFailed
    v_vertdb_node0009: PrepareFailed
Failed. Catalog change not committed on a quorum of nodes or there are nodes in unsafe states.

All databases changes failed. Rolling back all the changes ...
Rolledback.


  
    
      No changes applied.

Bryan_H · January 2024

Re-ip should not be necessary with Eon mode database restore, it should happen automatically if the primary node count is correct, as we expect node IP's might change in an Eon mode cluster. Does the database start anyways if you configure the resource to start from the restore location? If not, please open a ticket if possible, and I'll see if support can look at this during business hours.

hrouineb · January 2024

If you mean changing restore location to the one its meant to be used, I didn't change anything, left as the script did it, which is exactly the same as the original backup /data for data and /depot for depot. Just to make sure I checked the configuration file metadata/DATABASE_NAME/cluster_config.json

// jq ".Node[0]" cluster_config.json
      {
         "address" : "XX.XX.XX.XX",
         "addressFamily" : "ipv4",
         "catalogPath" : "/data/vertdb/v_vertdb_node0007_catalog/Catalog",
         "clientPort" : 5433,
         "controlAddress" : "XX.XX.XX.XX",
         "controlAddressFamily" : "ipv4",
         "controlBroadcast" : "XX.XX.XX.XX",
         "controlNode" : 45035996512324596,
         "controlPort" : 4803,
         "ei_address" : 0,
         "hasCatalog" : false,
         "isEphemeral" : false,
         "isPrimary" : true,
         "isRecoveryClerk" : false,
         "name" : "v_vertdb_node0007",
         "nodeParamMap" : [],
         "nodeType" : 0,
         "oid" : 45035996512324596,
         "parentFaultGroupId" : 45035996273704980,
         "replacedNode" : 0,
         "schema" : 0,
         "siteUniqueID" : 16,
         "tag" : 0
      }

Also tried to create support ticket here but it seems I don't have enough rights, can you help ?

Thanks a lot,

Bryan_H · January 2024

I am not able to change settings on the support portal, please contact your AE or field team for help there.

hrouineb · January 2024

Thanks, already did it, still waiting to get access to support portal so I could create the ticket

spilchen · January 2024

Are you running the re_ip manually? Or is this what you see when the operator tries to run it. The output from the re_ip doesn't tell us much. We may need to look at the contents of /opt/vertica/log/adminTools.log to see why it failed.

hrouineb · January 2024

No it's what the operator is trying to do in an infinite loop, circling through all nodes over and over. I attached adminTools.log of one of the nodes, let me know if you need something else.

spilchen · January 2024

What is the image that you have current deployed in your cluster? Is it the same version of vertica that the backup was taken from?

The adminTools.log had this error at each of the nodes "*** Core dump before operational or shutting down". We will need to collect scrutinize and investigate other logs. So, best to handle this through a ticket with our support org.

hrouineb · January 2024

Sorry for the late response, exactly the same, nothing has changed. And yes already created a support ticket, waiting for response

Thank you,

SruthiA · February 2024

@domidunas : Could you please share the error you are experiencing?

We're Moving!

Create My New Community Account Now

How to restore a full backup on Kubernetes operator based setup ?

Synopsis

References:

Answers

Leave a Comment