How to restore a full backup on Kubernetes operator based setup ?
Synopsis
Hello,
Currently we're having a small issue, we have a first Kubernetes cluster that hold Vertica setup that works just fine using communal storage on the other hand we need to create a new setup based on the old data, so we did a full backup using vbr.py script which did create all objects into AWS object storage, so far so good, our issue starts when trying to restore data using the right bucket, as the database needs to be shutdown beforehand, in which case the readiness probe fails because of its exec command vsql -w $(cat /etc/podinfo/superuser-passwd) -c 'select 1'
. Tried to check if the operator exposes the readiness probe through its CustomRessource verticadbs, but it seems no such property is possible. Thought also about changing manually the readiness probe but again it goes against what the operator stands for with its reconciliation loop.
My question is the following: How can we restore the database from a full backup using vbr script on a Kubernetes based setup ?
References:
vertica version : v12.0.3-0
vertica's kubernetes operator : 1.10.2
Kubernetes version: 1.28.2
VerticaDB C.R version: vertica.com/v1beta1
Thank you,
Answers
After reading the operator's source code I found a way to override the readiness probe, still interested in knowing the best way of doing a full restoration on a Kubernetes based setup.
Actually even though the readiness/startup probes are overridden nothing changes, when I play the
SELECT SHUTDOWN();
everything goes down, and pods restarts everywhere, which is normal, but then again, how can we do a restoration in such case ?There is a restore process in the documentation at https://docs.vertica.com/12.0.x/en/containerized/backup-restore/#restore-from-a-backup that describes the process including extending the livenessProbe.
Thank you for the insight, indeed it works, did a full restoration but the issue is the following command:
that fails all the time, as the operator tries to run it on all nodes but nothing gets applied because of the nodes state, it throws a rollback error at the end:
Is there something else we need to do ?
Actually I have a better understanding of what's going on, I think the cluster needs still access to old IPs, the fact that we can't run a
re_map
is because it's calling old IPs. Even though I ran this command to set the redirectionWhich seem to at least help start the process, it still it fails for the same reason:
Re-ip should not be necessary with Eon mode database restore, it should happen automatically if the primary node count is correct, as we expect node IP's might change in an Eon mode cluster. Does the database start anyways if you configure the resource to start from the restore location? If not, please open a ticket if possible, and I'll see if support can look at this during business hours.
If you mean changing restore location to the one its meant to be used, I didn't change anything, left as the script did it, which is exactly the same as the original backup /data for data and /depot for depot. Just to make sure I checked the configuration file
metadata/DATABASE_NAME/cluster_config.json
Also tried to create support ticket here but it seems I don't have enough rights, can you help ?
Thanks a lot,
I am not able to change settings on the support portal, please contact your AE or field team for help there.
Thanks, already did it, still waiting to get access to support portal so I could create the ticket
Are you running the re_ip manually? Or is this what you see when the operator tries to run it. The output from the re_ip doesn't tell us much. We may need to look at the contents of /opt/vertica/log/adminTools.log to see why it failed.
No it's what the operator is trying to do in an infinite loop, circling through all nodes over and over. I attached
adminTools.log
of one of the nodes, let me know if you need something else.What is the image that you have current deployed in your cluster? Is it the same version of vertica that the backup was taken from?
The adminTools.log had this error at each of the nodes "*** Core dump before operational or shutting down". We will need to collect scrutinize and investigate other logs. So, best to handle this through a ticket with our support org.
Sorry for the late response, exactly the same, nothing has changed. And yes already created a support ticket, waiting for response
Thank you,
@domidunas : Could you please share the error you are experiencing?