June 18, 2020
6 min read time

The best way to completely purge a Varnish cache

best_way_to_purhe_varnish_cache


It's cargo-cult fighting time! Today, we are going to look at a ban expression that you probably have used, and maybe even have recommended (gasp!) to your fellow Varnish users:

req.url ~ /

We'll discuss why we use it, why it's good but mostly bad, and how to fix it. Hopefully, along the way, we'll shed some light on some Varnish internals that you can use in other situations.

 

AdobeStock_305233591

 

Big red button

Cache invalidation is usually done in a pretty targeted manner, with the system purging only one object, or an object type, or a dependency tree. Purging more than needed means spending resources re-fetching objects that were totally fine, so it's usually frowned upon.

But sometimes, you need a bigger hammer, like, a really big one that is going to wipe the whole cache. It may be because a targeted approach is too complex, or because something went exceptionally wrong and you need a hard reset, or worse, because you had no invalidation strategy in the first place (if so, talk to us, we can help!).

For the cases where you need that big red nuclear button, people usually have two solutions: let's look at them now.

 

Restarting Varnish

This is acceptable in a test environment, but, without mincing words, it's a pretty terrible idea in production.

The first reason for this is that if you are using MSE, it doesn't work, since the primary goal is to NOT lose the cache in case of a restart/reboot. And secondly, a restart actually stops the process. The service interruption only lasts a split second since Varnish is super fast to come back up, but all the in-flight requests are thrashed, which is less than ideal.

 

Banning EVERYTHING!

The second solution is to "ban" all the objects in cache, which is actually a good idea!

Banning is an invalidation technique that can invalidate multiple objects at once, and you can trigger it from VCL (needs some custom code, but then you can ban using HTTP) and from varnishadm, so you can easily trigger it with an HTTP request, or manually (useful when hell breaks loose and you didn't prepare for that).

  1. A ban is simply an expression that Varnish will run cached objects again, and if it matches, the tested object is removed. Two important points about this:
    the ban effect is immediate: once active, no banned object will make it out of the cache.
  2. Only objects older than the ban will be affected by it. So a ban only impacts the objects currently in the cache, not the ones that will enter later.

The idea is then to use an expression that covers all the objects, and this is where "req.url ~ /" comes in. To be fair, "req.url ~ /" is pretty good for one thing: it's easy to understand. It's going to match any request whose URL regex match "/", and since the path of a resource must start with "/", we are fairly sure that this will apply to all the requests.

But we can do better!

 

The regex issue

I'm going to sound like a broken record, but: you don't need that regex. For something as simple a our current need, we can go with:

req.url != ''

meaning: any request with a non-empty URL. As the URL can't be empty, this covers everything.

Why bother? Because a regular expression involves a relatively heavy process. Don't get me wrong; it's a super useful tool, but it has a cost and in our case, we need that expression to run through the whole cache, possibly testing millions of objects, so we want to make it easy on ourselves (and on our CPUs).

That's a nice boost, but the biggest one is yet to come.

 

Look out for the lucky lurker

To explain this, we need to go a bit deeper into how the banning process works.

Bans are, as we have seen, simple expressions to filter objects. As soon as they are activated, they are added to the ban list and start working via two mechanisms: lookup-time eviction and the ban lurker.

When testing an object for ban, Varnish goes through each ban in the ban list and:

  • makes sure the ban creation happened after the object entered the cache
  • if so, tests that the ban expression matches the object properties
  • if so, it invalidates the object

This work is sequential, so we want to make sure expressions are evaluated quickly, as explained above, but we also want to make the ban list as short as possible by letting bans do their work and disappear as soon as possible.

Very simply, a ban will stay around until there's no more work for it, i.e. all the non-examined objects in the cache are younger than its creation date. Eventually, all old objects will expire, and Varnish will retire the ban, but the other way works too: if a ban has visited all the objects older than itself, it can be retired.

Note: you can also look at the ban_cutoff parameter to forcefully retire bans, if collateral eviction isn't too important for your setup.

Now that this has been covered, let's see when the ban list is used.

Lookup-time eviction

To make sure that no banned object makes it out, Varnish essentially performs a check against the ban list whenever it looks for an object in cache. Any potential hit is checked against the bans and is possibly evicted, forcing Varnish to look further for a hit. This takes care of the ban-immediacy requirement, but it also means that it's driven by traffic, without it, no lookup, no potential hit, and therefore no eviction.

Ban lurker

Behind this slightly creepy name hides a thread that runs through the whole cache and tests objects in hopes to evict them. Simply put, it's our ticket to removing objects even without traffic.

But... there's a big caveat: because it runs in the background, it's not linked to any request, so it cannot evaluate expressions using "req.*" elements.

And that is the big problem with "req.url ~ /", or even "req.url != ''", they don't leverage the ban lurker, and you'll possibly need to wait a long time before the objects and relevant ban are actually out.

To use the lurker goodness, we need to use obj.* variables, which describe the actual object being stored rather than how we got to it. This is what we call "ban-lurker friendly bans", which is a bit of a mouthful, but eh, not much we can do about it now!

 

The solution

We can use the non-empty URL check presented earlier, but we can do something else in the same vein: check that a field doesn't have an illegal value to catch every one.

We need to check an object/response element that we are sure exists, like obj.status, and it can't be equal to 0, so

obj.status != 0

will do nicely and match all the objects!

As the Hulk says: I see this as an absolute win: we get faster checks, we don't lock the lurker out and the expression is as simple as the original one!

 

What about ykey?

I'm glad you asked, fictional but astute reader! ykey is another mass invalidation tool, with two major differences when compared with bans:

  • it's tag-based, allowing to push all the classification during the cache insertion, simplifying, and speeding up the purging process greatly.
  • the invalidation is also immediate but synchronous, so memory is reclaimed immediately.

The only caveat is that there's no way to tell ykey to purge everything. Instead, we tagged everything with a common tag:

sub vcl_backend_response {
  ykey.add_key("all");
}

then we can purge everything tagged with "all":

ykey.purge("all");

And that's it. Of course, you can use it more precisely, to tag all images, or all resources from a certain user, but I'm already digressing, so let's keep that for a later post.

 

Wrapping it up

Banning is sometimes a bit overlooked, mainly because things sort of work by copy-pasting recipes from the internet, but it's always good to know a little bit more about how things work internally before making a conscious implementation decision.

Admittedly, this article was little more than an excuse to explore how bans work, and I tried to keep it fairly short so the trickery wasn't too obvious. That means some details have been left out, so if you want to read more about banning and cache invalidation in general, please have a look at this invalidation tutorial over on the docs site.

Until next time!

reach_people_faster