Achieving and Retaining High Performance with Varnish

Varnish is fast - very fast. But you already know this, don't you? It's also, thank to the VCL, an incredibly versatile tool. Is that enough?

Of course, given the title of the post and the rhetorical tone of the question, the answer is "no". And yet, too many people confronted with performance issue "just put Varnish in front of the server; it just works". Varnish, though, is like an afterburner: ready to give you a huge speed boost, but YOU still have to steer the ship.

Test

I know, I know, you are already aware that testing is primordial, but do you actually do it? (No, opening the frontpage of your site in Firefox doesn't count.)

We already posted a piece about varnishtest, but it bears repeating: it is an awesome tool. One thing people often overlook is that you can use it to test more than just Varnish. For example, have a look at this:

varnishtest "Check Google"

client c1 -connect "google.com 80" {
    txreq
    rxresp
    # google.com returns a redirection code, not a 200 OK
    # the more you know!
    expect resp.status == 302
} -run

It's a very simple example, but one with no Varnish, and connecting to real world servers, and with a rather understandable syntax. Now, you have no excuse! Go forth and test!

Keep your VCL legible, if not simple

The Varnish Configuration Language is extremely powerful, allowing you to route requests to the right backend, rewrite headers, or override caching policy, and all this comes at a price: by doing away with the usual key/value logic, the VCL is actually a programming language.

And programming is hard to get right. Because of all the power it has, it compels people to go crazy. Fixing a TTL value in VCL is often easier than fixing the backend, and we keep piling up exceptions in the VCL when we should take some time and fix the bug in the right place.

Sometimes, your VCL HAS TO be huge, and really, all that is needed is love discipline and logic:

indent your code, seriously, do it.
comments are supremely important, for you, and those who will follow
split your VCL in separate files if it grows too large.
isolate backend definitions and ACLs in a file. That way, modifying the VCL for a test environment will be much easier.
use functions (subs) with a descriptive name, and hide the definitions away from the main VCL file, to make it easy to see the logic of your configuration.
import and include all your VMODs and VCL files at the top of the file.
avoid crazy include chains, if possible, have one top file that includes all the others.
use syntax highlighting (vim has a plugin for it, emacs too, but the perl highlighting works pretty well too).

It may seem obvious (good thing if it does), but debugging spaghetti code isn't something you want to do, ever, and certainly not under stress.

Tune and monitor

Varnish is fast, but it wouldn't do much without an operating system underneath it, and a good deal of the time, that OS is some sort of GNU/Linux. Let's remember that Linux is very customizable and runs on a dizzying number of machines, from Raspberry Pi to phones to super-computers, so it has to be tuned. You distribution is probably targeted towards servers, but a server can be a lot of things, and here we are aiming for speed, so it's worth going the extra mile.

Tune your network stack to accept enough bandwidth, recycle ports faster if you need it.
Tune IRQ handling in order to not saturate your processors.
Tune your file systems and mount options.
Tune the vm to force dirty pages out early if using the file storage.
Put the share memory log file in a tmpfs
And so forth...

These are a few of the things you can do to help Varnish deliver the best possible performance. Varnish takes pride in not working against the OS, and sometimes it means not working around it, so it's up to you to ensure the OS works as intended and the machine as best as it can.

A few years ago, there was this discussion about whether servers were like cattle or pets, but really, either way, you should keep an eye on them, all the time. Get as much information as you can, but above all, know what to look for. Netflix had a nice post about performance analysis. Don't get paranoid, but learn to identify patterns, like a falling hit ratio, increase in backend use, or unusual CPU usage, and ACT on them.

Again, tons of tools are here to help, varnishstat or varnish-agent for "basic" Varnish monitoring, VCS for more specialized information, like aggregated data about response times of a specific backend, or hit ratio of a certain class of urls.

Speaking about hit ratio...

That may sound like obsession from Varnish users, but there is a sound reason: Varnish is here to handle the load directed at your servers. The higher the hit ratio, the lower the load on the backend. So, keep that ratio up!

There are multiple strategies for managing hit ratio, from using a level-7 load-balancer in front of Varnish servers (that can itself be a Varnish server) to using Varnish High Availability. But remember, even with perfectly tuned and monitored servers serving with high hit ratio...

..things will go wrong

Sometimes, the workload will be too much. There, I said it. It will happen, you'll get too many requests, the backend will fail, a contractor in the datacenter will unplug your server to plug in his server. It happens, we have to accept it.

But that doesn't mean we can't do anything about. Actually, there's a lot we can do. Varnish, as the first line of defense, can do pretty much anything, thanks again to VCL. Here are some examples.

Probing

Backend probes, for example, will detect failures and re-route requests to a healthy backend. For this to work, tuning is important, as a probe must reflect normal/stress load and be allowed to fail if stuff goes wrong. Let's see a (probably) bad probe:

probe foo {
    .url = "/";
    .timeout = 10s;
    .interval = 12s;
    .window = 8;
    .threshold = 2;
}

This will ask for "/" to the backend every 12 seconds, timing out after 10 seconds and will declare a backend healthy if 2 of the last probe were good. This means that if the machine drops dead, Varnish can take more than a minutes (12s * 6 bad probes) to realize it. In addition, 10 seconds for a time out seems awfully long. If you expect the backend to take this long, it may be time to fix the backend.

Finally, "/" is often not a good probe target as it may not represent meaningful work and may still work even if the machine is failing.

Ultimately, probes should react fast AND be trustworthy, but it is a delicate balance to find, and here again, monitor can help a great deal.

Grace

It was already present in Varnish 3, but in 4.0, grace has really become awesome. Gracing an object allows it to be served even though its TTL has expired, and this, coupled with the VCL's ability to inspect the health of a backend is very powerful.

We can write this:

sub vcl_hit {
    if (obj.ttl >= 0s) {
        # no question, object is still valid, we deliver
        return (deliver);
    } else if (!std.healthy(req.backend_hint) &&
                             (obj.ttl + obj.grace > 0s)) {
        # object is invalid, but graced,
        # and the backend is down, deliver what we have
        return (deliver);
    } else {
        # none of the above, it's a miss, go fetch it
        return (fetch);
    }
}

This, combined with probes, allows your site to keep running while your backend recovers, at the price of serving stale content.

Throttling

The internet can be a crazy place (it's probably why it's called "World Wild Web") and if you want a good QoS for your users, you may have to stop serving abusers. Well, the ones causing too much traffic at least.

It turns out that there's a VMOD for that! It's called vmod-throttle and works like this in your VCL:

if(throttle.is_allowed(client.ip + req.url, "2req/s, 1000req/d") > 0s) {
    error 429 "Calm down";
}

The function takes two strings as arguments, the first one is an identifier and the second is a list of requests per period. Here, we are limiting one IP on one URL to 2 requests per seconds or 1000 requests per day, whichever comes first.

As identifier, we could also use the user-agent, a cookie, or the country code of the user, allowing you to really target specific aspects of a request.

Wrapping it up

All of this is probably not new to you, but it's wotrh insisting: even if Varnish does the heavy-lifting, you still have to pilot it. Fortunately, the Varnish ecosystem provides us with all the help we need to integrate and operate it at top speed, conveniently and reliably.

Are you ready to achieve and retain your optimal web performance? Why not register for a Varnish Plus trial?

Achieving and Retaining High Performance with Varnish

Test

Keep your VCL legible, if not simple

Tune and monitor

Speaking about hit ratio...

..things will go wrong

Probing

Grace

Throttling

Wrapping it up

Three new components to increase website scalability and performance

Varnish High Availability 1.1 is out

The Varnish Enterprise advantage

Achieving and Retaining High Performance with Varnish

Test

Keep your VCL legible, if not simple

Tune and monitor

Speaking about hit ratio...

..things will go wrong

Probing

Grace

Throttling

Wrapping it up

You may also like this

Three new components to increase website scalability and performance

Varnish High Availability 1.1 is out

The Varnish Enterprise advantage

SUBSCRIBE TO OUR BLOG

SEARCH OUR BLOG