October 23, 2018
7 min read time

Howto: Respond to probes

space-probe

Two years ago, I wrote an article about how probes work in Varnish (it's a great article - fun, informative... go read it), it covers a lot of ground, but still, it misses one important spot. More precisely, it only focused on how Varnish uses probes to know whether a backend is worth contacting, so today, we are going to look at the other side of the story: how do we tell the rest of the system that Varnish is up and ready to work?

Also, we'll see how to handle maintenance: if you need to get your Varnish node offline, it's annoying to log into all the load balancers to re-configure them; it's easier to just tell Varnish to fail incoming probes until said load balancers take the node out of their pool, and then you can wait for the active connections to end (does it ring a bell?) and then stop Varnish.

Hop on! We'll have a look at different ways of doing it - good and (mostly) bad, to try and understand how to do it and be warned of the various pitfalls to avoid.

The irrelevant one

Let's get this one out the door first as it's not very interesting, practical or even making use of any Varnish-specific capacity: TCP probing. Basically, Varnish just needs to accept connections to be deemed healthy. In itself, it's a fairly bad test, but sometimes, this is all you get, so...

The problem is that our lovely Varnish Configuration Language (VCL), as powerful as it is, only kicks in once we have an HTTP request. Varnish will automatically accept connections, parse HTTP requests (and refuse malformed ones), without us doing anything. And of course, we only want to refuse probe connections, not legitimate requests.

The easy solution is iptables: have Varnish listening on a secondary port (add a "-a :$PROBING_PORT" in the varnishd command line), and point the probes at it. When you need to fail probes, add a REJECT rule on the port.

There, let's move to more fun and Varnish-specific solutions.

The local one

If you are on the Varnish node, you can actually just use varnishadm to ping varnishd:

varnishadm -t 1 ping

The "-t 1" reduces the timeout parameters from the default five seconds down to just one, becomes I'm a bit impatient, and really, that command should return instantaneously anyway.

It's pretty cool because it's super simple and you don't need to know what port varnishd is running on. The caveat is that you can't really tell Varnish to fail this probe, so we have to introduce a flag file: a simple empty file whose presence will act as a marker. If we name it /etc/varnish/answer_probes, then our test can become:

test -e /etc/varnish/answer_probes && varnishadm -t 1 ping

Remove the file to fail the probe, create it to come back up. This is a bit sketchy here, mainly because the test needs to be aware of the flag file, but the idea has its merits, as we will see very soon.

The good one

Now, and for the rest of the article, let's assume we need to answer HTTP probes with a 200 response to signal up-and-ready-to-work-ness (yes, that is an actual word). The go-to snippet is of course this one:

sub vcl_recv {
	if (req.url == "/varnish_check") {
		return (synth(200));
	}
}

That's simple, straightforward and so dumb it can't fail, right? Right, and that's actually the issue here: it can't fail! If we want to stop responding, we have to change the VCL to something like:

sub vcl_recv {
	if (req.url == "/varnish_check") {
    	return (synth(404));
    }
}

Which is equally readable and all, but we changed the VCL on this machine only, so by modifying the configuration, you put that server at odds with the rest of the cluster (of course, that's what you want, if you think about it).

That may be an annoyance if your cluster is managed as a single unit: just Imagine that 23 seconds after your change, the Jenkins QA job succeeds, triggering an Ansible push to the whole cluster. The VCL changed under you and varnishd is responding to probes again - darn!

Thankfully, there's a VMOD for that, or rather, vmod_std has an function for that, with a very explicit name:

import std;

sub vcl_recv {
	if (req.url == "/varnish_check") {
		if (std.file_exists("/etc/varnish/answer_probes")) {
			return (synth(200));
		} else {
			return (synth(404));
		}
	}
}

Again with the flag file! That way, the VCL can stay the same and you can just rm/touch the flag file to take the machine out or put it back in! This is a pretty basic idea, but it does the job perfectly, and be warned, we'll reuse it in all the other solutions.

The dangerous one

To be very clear: you DO NOT want to use this one! It's a bad boy: you think it's the one you want because it brings you that little something extra, but it'll leave you crying in the corner after it betrays you.

One limitation of the previous proposal is that it only tells you about the node status, not about service: if all the backends behind Varnish are dead, there's no point in asking it. It's not my preferred way of action (I'll explain later in this post), but you may have very viable reasons for linking Varnish's status with its backends.

So, you install the probe file on each backend, or even better, you use something that tests the backend thoroughly and use the VCL to adapt the URL:

import std;

sub vcl_recv {
	if (req.url == "/varnish_check") {
		if (std.file_exists("/etc/varnish/answer_probes")) {
			set req.url == "/backend_check";
			return (hash);
		} else {
			return (synth(404));
		}
	}
}

It runs for a while, you're happy and forget about it, until the fateful day when all your backends explode but you only realize it four hours later, in the middle of the night, because you were trusting Varnish to tell you about it and it didn't. All of this because of one thing: you forgot Varnish was a caching server, and it cached the /probe.php responses, having you believe you were safe. 

But, before you start blaming yourself for such carelessness, to be fair:

  • Varnish is pretty great for exactly that: hiding dying backends.
  • Others before you have made that mistake.
  • You didn't actually make the mistake; this was just a narrative device to explain the issue!
  • The situation is easy to fix and is presented below.

The meh one

Let's almost fix our VCL, in the most intuitive way and see where that takes us, shall we? We got burned last time because Varnish was caching, right? So we have to bypass the cache this time:

import std;

sub vcl_recv {
	if (req.url == "/varnish_check") {
		if (std.file_exists("/etc/varnish/answer_probes")) {
			set req.url == "/backend_check";
			return (pass);
		} else {
			return (synth(404));
		}
	}
}

And that will work... better. But only slightly, because you risk inconsistencies in your reports, and it's going to be at least annoying.

Let me explain: probes are a way to consolidate status over a period of time, your backends may not reply ok all the time, but if enough probes are fine, the backend is healthy. And multiple backends are a way to increase opportunities to retrieve content, if one is sick, go to another. But the above VCL negates that by picking just one value in the middle of this, meaning that even though all your backends are technically healthy, you may still fail that one probe.

This places you in a situation where Varnish can fail a probe because of a backend, even though, internally it thinks (and reports) that they are fine. In practice, it can be super frustrating to debug.

The good one, too

Time to fix some VCL again, and truly, this time! We have finally learned what we really want: to align the probe response with Varnish internal status of the backends. Basically, we just need to reply 200 if at least one of our backends (let's say we grouped them in a director named "dir") is healthy.

Turns out, vmod_std, once again, has a function for that:

import std;

sub vcl_recv {
	if (req.url == "/varnish_check") {
		if (std.file_exists("/etc/varnish/answer_probes") && std.healthy(dir.backend())) {
			return (synth(200));
		} else {
			return (synth(404));
		}
	}
}

Now we'll only reply 200 if the flag file exists AND we have a backend that we can use. Start the conga line, we have a winner!

Wait, if there are two good ones, what's the best one?

As you can imagine, the answer here is highly dependent on context. Generally, and personally, I tend to prefer the first version:
import std;

sub vcl_recv {
	if (req.url == "/varnish_check") {
		if (std.file_exists("/etc/varnish/answer_probes")) {
			return (synth(200));
		} else {
			return (synth(404));
		}
	}
}

Systems should be be cleanly layered, and one tier should only care about its immediate neighbors. By asking Varnish to punch holes in that model, things get muddy.

Additionally, if you use the std.healthy() version, it becomes very tempting to only use this as a metric for the backend pool, but remember: Varnish will then only tell you when ALL the backends are down, when you actually want to know when the first one bites the dust (we'll cover this in an upcoming post).

Also, note that it doesn't have to be one or the other! For example, you can use the first version for monitoring, and the second one for routing: just use a different probe URL and you are good to go! Heck, go even further: you can check on multiple directors, using a probe URL for each to report abort service uptimes!

In conclusion

The main takeaway here is that you have to be a bit careful with what message Varnish should send back, and while there are a few mistakes you can make, things are generally simple because there are only two choices. However, remember that we only focused on Varnish and what it tells the rest of the system. It is equally important that said system make good use of the message! But that one is outside of Varnish's scope, and will be left as an exercise for the reader ;-)

As always, should you want to dig deeper into this subject (including discussion of said exercise!), do not hesitate to leave a comment, ping me on Twitter, or join IRC or the varnish-misc mailing list to chat!