Vertica cluster(DC) to cluster(DR) replication , may be using streaming technology
Hi Team,
I have a customer setup which is of 3 node cluster. They would like to setup a DR setup which is exact replica setup of production. I am looking for a replication solution from production cluster to DR setup so that there will be no event loss.
AFAIK, I can use vbr tool supported by Vertica to replicate the entire cluster. However, I am skeptical about the data loss part. In what frequency will it sync the data? Does it use some kind of kafka streaming so that we set offset and don't loss any data? I don't know it's internal how it works.
The other option I found in vertica doc, we can use kafka for data replication using inbuilt kafka package in vertica. Can someone help me to understand how it work?
Thanks in advance,
Regards,
SM
Comments
You said: "I am skeptical about the data loss part" - What data loss part? The Replication operation runs in a transaction so failures won't cause any data corruptions.
You said: "In what frequency will it sync the data" - That's up to you. Vbr replication is incremental. That is, the first replication copies all the data and subsequent replications copy only modified/ newly added data. You can create a cron tab job to run vbr on a set schedule.
Thanks for the quick response.
My only concern is high volume data and with high velocity. The application is an event generator, the EPS of the event can go around 5-10 K , which can go very high in velocity and size. I can see a delta change of 3 GB in 5 minutes time and that has to be synced over WAN link.
I believe the vbr tool works using rsync for replication and I believe object level replication is best fit for this solution. I am just wondering how it checks the offset and avoid data duplication , if we have a file already synced and try to restore again.
It will be a great help if we have any best practice documentation for replication of cluster.
There is one slightly related question, since the customer works under compliance I would like to setup the encrypt and checksum option to true ,that will impact the performance as well, right?
Do you see any other option for this case, or this is the best option. I was checking another option where we can stream the data using kafka integration ( vkconfig tool). Any advise?
We check the file size for file identity. So if a file is already synced and tries to sync again, rsync list the file first and see if the file size match, if match, we skip the copy.
Yes, encryption requires process overhead.
How far behind can the DR cluster left behind, would you like to sync daily or by hour?
Thanks for the comment. I am looking for syncing every 10-15 minutes.