Some time ago, we discussed backend pools and how to load-balance inside them using directors, remember? During the first post I hinted at forcing backends to "sick" before maintenance of a backend but didn't go into more detail. Today it's time to have a short yet closer look at how you can cleanly take a backend out and insert it again.
The main aim
Let's say that we need to upgrade, or maybe restart, one of our backends. That operation will make the machine unavailable to us in case of a cache miss, so we need to work around that.
Maybe you are thinking "But, Guillaume, we have probes, surely Varnish will see that the backend is down and will send requests somewhere else." You would be right, and also wrong. It's true that probes will allow Varnish to detect that the server is unavailable, but remember that probing occurs out-of-band, periodically, which means that there may be some latency in the detection and we could send a request to an already-dead machine. We need something better.
No vain claim
What is going to be proposed is pretty standard, and by no means the only way to achieve it. Notably, you could pilot the whole operation for the backend by configuring the origin to fail probes and use the trusty "ss" command line to monitor connections. But this is a Varnish blog post, and we'll see a Varnish-centric version of the process.
The big bonus is that it'll work on any backend, and that if you are not in charge of the origin, the communication needed with the backend admin team is minimal.
First: refrain (or abstain)
The first thing we need to do is to tell Varnish to stop sending requests to the targeted machine, and maybe the first idea coming to your mind is to edit the VCL to comment out the backend definition in it. That will work, but it also means tracking all the backend uses in the configuration as Varnish doesn't like undefined symbols in its VCL.
There's an easier, quicker way, though: we can force the server state to "sick", making Varnish "believe" the server is down so it won't send new requests. For this, we need only one command:
varnishadm backend.set_health *.backend1 sick
The command is pretty much self-explanatory except for one thing: why are we globbing (using the "*") instead of just using "backend1", the name of the backend? To explain it, let's look at this:
varnish> backend.list *.backend1
200
Backend name Admin Probe Last updated
vcl-1510766009.backend1 probe Healthy (no probe) Wed, 15 Nov 2017 17:16:12 GMT
boot.backend1 probe Sick 0/8 Wed, 15 Nov 2017 17:16:26 GMT
As you can see, the backend name is composed of two parts: the VCL name as a prefix, followed by the backend name in the VCL. This happens because in Varnish, the backends are owned by the VCL, and multiple ones can be warm at the same time (i.e., the probes are running so you can quickly swap a new one in). In a true Pokémon fashion, the globbing allows us to catch them all.
Note: we can set a backend to sick no matter what; it may or may not have a probe, we don't care.
Second: wait, drink a tisane
Now, no new request will be issued to the origin, but we are not out of the woods yet because we may still have requests, or responses over the wire: think about a slow request that was issued right before the server was set to sick. So we have to wait for these requests to complete if we don't want to disturb user traffic.
Varnish has a nice tool to monitor progress called varnishstat. It contains a lot of information about the global state of Varnish, useful to both regular users and to Varnish developers, but since we don't need everything, we are going to once again glob to only filter what we care about:
# varnishstat -1 -f *.error.conn
VBE.boot.error.conn 0 . Concurrent connections to backend
VBE.vcl-1510766009.error.conn 0 . Concurrent connections to backend
This way we only get the connections open to backend1, from one vcl or another. once all the lines reach 0, we are good to go. And since we are lazy (at least I know I am), here the command line will keep running until there are no more connections:
while ! varnishstat -1 -f *.error.conn | awk '{if (\$2 != 0) exit 1}' ; do sleep 1; done
It's quick and dirty, but it works so well it would be wrong not to use it.
Once all connections are down to 0, you can pick up the phone and call the backend team and greenlight the intervention.
Third: wait again
Well, we need for the maintenance operation to be finished, don't we? Sit back, relax, and wait for the phone call telling you that the machine is ready to roll.
Four: regain health
Once the server is back in business, we have to add it back into the pool, and it's really not that complicated, we just need to call:
varnishadm backend.set_health *.backend1 auto
Again, it's self-explanatory, but as we set the backend to "sick" in the first step, you may have been expecting a "healthy" here, no? It turns out that healthy is the exact opposite of sick, by which I'm saying that Varnish would "believe" the server to be healthy, no matter what the probe says, which is a bit heavy-handed. Notably, if the backend team called you too early (machine booted, but the service is not accepting connections yet), you're in for a ride.
Set the backend(s) to auto revert to the initial state: healthy if no probe, and trusting it if it's there, which is really what we want. In that case, if you trigger the command too early, no problem, the probe will report a sick backend until it's actually back up.
And we are done!
In the same vein...
It took us a while to get there, but it's important to grasp the individual, simple Varnish elements used here as it goes a long way toward integration and reusability. Notably, since all this is really just using varnishadm and varnishstat, you can for example use the varnish-agent to hook that process to Jenkins, or Kubernetes for example.
Let me know what you think in the comments!