Balancing Vertica for resiliency
good day everyone,
we have a 3 node vertica 8 cluster, running on sles11 VMs.
It happens that during backup process one node stops (i will open a separate thread on this topic).
My task is to find a way to ensure the clients can connect to one of the running nodes when this happens. Native load balancing it's not the solution, so i'd like to try using an open source solution like HAproxy.
Has someone tried this? I've read that someone is using an hardware load balancer: could you please report some info about the configuration used? I.e. what is used to check the availability of a node (port 5433 listening?)
It would be very nice, thank you in advance
Alessandro
0
Comments
follow up: it seems it was very easy to configure HAproxy in tcp mode for load balancing.
Now i "only" have to configure checks to let HAproxy detect when a Vertica host is down.
The HAproxy config i've made is
listen vertica-cluster
bind 0.0.0.0:5433
mode tcp
balance roundrobin
server vertica01 10.0.1.8:5433
server vertica02 10.0.1.9:5433
server vertica03 10.0.1.10:5433
this simple haproxy confg seems to fit the basic needs to balance the load and avoid sending conns to the nodes in down
frontend vertica
bind 0.0.0.0:5433
mode tcp
option tcplog
default_backend vertica_cluster
backend vertica_cluster
mode tcp
balance leastconn
server vertica01 1.1.1.8:5433 check fall 1 rise 2
server vertica02 1.1..9:5433 check fall 1 rise 2
server vertica03 1.1.1.10:5433 check fall 1 rise 2
just add something like
timeout client 8h <-adjust to your needs
to avoid disconnections on client inactivity
If this is just checking whether port 5433 is accepting a connection, then also consider the scenario where a node is recovering. In this case the process is running, and listening port 5433, but if you try to connect you get a message about the node being in recovery. This is at least true in 7.2 - I haven't tested it in 8.0. So you still want the node to remain out of the load balancer until it's truly up. One way to do this is a health check of sending a "select 1" query to all nodes to verify that they are responding to queries.
--Sharon
thank you for adding this info!
in my scenario probably we don't need such an health check, so i will probably not spend time in trying it, but this can help someother someday.
Alessandro