June 25, 2024
11 min read time

Varnish and Observability (Part 1: The Basics)

Varnish and Observability (Part 1: The Basics)
5:05

 

Is Varnish fast? Of course it is! I know that, you know that, but I want my friends, and my friends’ friends to also know that. But of course, they aren’t just going to believe me, because, as one wise man said: “bragging rights aren’t given, they have to be demonstrably earned”.

This is where monitoring and observability come into play. We need them to collect data about how Varnish is doing, and send that information somewhere™ it can be aggregated, crunched, and correlated with various events on the platform. In short: we need to make it useful. How do we do that?

Well, I’m an engineer, so of course, my go-to answer will be: “it depends”. But because I really want you to brag about Varnish performing very well, I have something for you today.

 

Metrics, all of them

If you know Varnish even at all, you know that we take great pride in exposing as much information as possible (and in the most efficient way possible), to the point that the only word that comes to (my) mind is “brutal”.

Let’s start with metrics: our tool of choice here is varnishstat. This deceptively simple tool that’ll happily dump all the internal counters that Varnish constantly updates. And I literally mean ALL of them, we hide nothing, no sir!

For example, here’s what I get on my laptop, with a fairly vanilla configuration:

$ varnishstat -1 MGT.uptime 261075 1.00 Management process uptime MGT.child_start 1 0.00 Child process started MGT.child_exit 0 0.00 Child process normal exit MGT.child_stop 0 0.00 Child process unexpected exit MGT.child_died 0 0.00 Child process died (signal) MGT.child_dump 0 0.00 Child process core dumped MGT.child_panic 0 0.00 Child process panic MAIN.summs 88 0.00 stat summ operations MAIN.uptime 261077 1.00 Child process uptime MAIN.sess_conn 10 0.00 Sessions accepted MAIN.sess_fail 96 0.00 Session accept failures MAIN.sess_fail_econnaborted 0 0.00 Session accept failures: connection aborted MAIN.sess_fail_eintr 96 0.00 Session accept failures: interrupted system call MAIN.sess_fail_emfile 0 0.00 Session accept failures: too many open files MAIN.sess_fail_ebadf 0 0.00 Session accept failures: bad file descriptor MAIN.sess_fail_enomem 0 0.00 Session accept failures: not enough memory MAIN.sess_fail_other 0 0.00 Session accept failures: other MAIN.client_req_400 0 0.00 Client requests received, subject to 400 errors MAIN.client_req_417 0 0.00 Client requests received, subject to 417 errors MAIN.client_req 10 0.00 Good client requests received MAIN.esi_req 20 0.00 ESI subrequests MAIN.cache_hit 0 0.00 Cache hits MAIN.cache_hit_grace 0 0.00 Cache grace hits MAIN.cache_hitpass 0 0.00 Cache hits for pass. MAIN.cache_hitmiss 18 0.00 Cache hits for miss. MAIN.cache_miss 30 0.00 Cache misses <snipped>

The total output contains 356 counters (again, on my vanilla laptop setup) including currently used connections and cache usage, down to the number of times a specific mutex had to wait to get locked. It’s a treasure trove of information, and while you will probably never need ALL of it, what you do need is most certainly in it.

varnishstat makes it easy to sift through a wealth of information. You can query individual counters, or glob entire categories. All of that is just one filter away on the command line. We’ve got you covered in terms of visualization options, too.

We’ve already seen the -1 argument to dump the stats and exits, but we also have the default interactive view:

output

But if you want to export the data, you’ll probably want varnishstat -j for some JSON goodness:

$ varnishstat -j { "version": 1, "timestamp": "2024-05-17T16:54:58", "counters": { "MGT.uptime": { "description": "Management process uptime", "flag": "c", "format": "d", "value": 262315 }, "MGT.child_start": { "description": "Child process started", "flag": "c", "format": "i", "value": 1 }, "MGT.child_exit": { "description": "Child process normal exit", "flag": "c", "format": "i", "value": 0 }, <snipped>

Note: this is the output that is internally used by the great Varnish Prometheus Exporter.

You’re feeling old school and want some XML? We have that too, and I promise I won’t judge you:

$ varnishstat -x <?xml version="1.0"?> <varnishstat timestamp="2024-05-17T16:56:28"> <stat> <name>MGT.uptime</name> <value>262405</value> <flag>c</flag> <format>d</format> <description>Management process uptime</description> </stat> <stat> <name>MGT.child_start</name> <value>1</value> <flag>c</flag> <format>i</format> <description>Child process started</description> </stat> <snipped>

 

What about logs?

Oh boy, I thought you’d never ask! Logs in Varnish are handled a tiny bit differently than in other applications, in the sense that Varnish itself doesn’t write log files. Instead, it just dumps the maximal set of data into memory and lets external loggers read that memory and format it the way YOU want and need it. In essence, Varnish is constantly logging everything in memory, and you can access that memory through a couple of tools.

First up is my favorite debugging tool of all time: varnishlog. If varnishstat was brutal, varnishlog is just bonkers, have a look:

* << Request >> 2 - Begin req 1 rxreq - Timestamp Start: 1556292791.607415 0.000000 0.000000 - Timestamp Req: 1556292791.607415 0.000000 0.000000 - ReqStart 127.0.0.1 45212 - ReqMethod GET - ReqURL / - ReqProtocol HTTP/1.1 - ReqHeader Host: 0:8080 - ReqHeader User-Agent: curl/7.64.1 - ReqHeader Accept: */* - ReqHeader X-Forwarded-For: 127.0.0.1 - VCL_call RECV - VCL_return hash - VCL_call HASH - VCL_return lookup - VCL_call MISS - VCL_return fetch - Link bereq 3 fetch - Timestamp Fetch: 1556292791.612406 0.004991 0.004991 - RespProtocol HTTP/1.0 - RespStatus 200 - RespReason OK - RespHeader Server: SimpleHTTP/0.6 Python/3.7.3 - RespHeader Date: Fri, 26 Apr 2019 15:33:11 GMT - RespHeader Content-type: text/html; charset=utf-8 - RespHeader Content-Length: 12771 - RespProtocol HTTP/1.1 - RespHeader X-Varnish: 2 - RespHeader Age: 0 - RespHeader Via: 1.1 varnish (Varnish/5.2) - VCL_call DELIVER - VCL_return deliver - Timestamp Process: 1556292791.612452 0.005036 0.000046 - RespHeader Accept-Ranges: bytes - RespHeader Connection: keep-alive - Timestamp Resp: 1556292791.612531 0.005116 0.000079 - ReqAcct 70 0 70 256 12771 13027 - End ** << BeReq >> 3 -- Begin bereq 2 fetch -- Timestamp Start: 1556292791.607599 0.000000 0.000000 -- BereqMethod GET -- BereqURL / -- BereqProtocol HTTP/1.1 -- BereqHeader Host: 0:8080 -- BereqHeader User-Agent: curl/7.64.1 -- BereqHeader Accept: */* -- BereqHeader X-Forwarded-For: 127.0.0.1 -- BereqHeader Accept-Encoding: gzip -- BereqHeader X-Varnish: 3 -- VCL_call BACKEND_FETCH -- VCL_return fetch -- BackendOpen 33 boot.default 127.0.0.1 9999 127.0.0.1 33424 -- BackendStart 127.0.0.1 9999 -- Timestamp Bereq: 1556292791.607716 0.000118 0.000118 -- Timestamp Beresp: 1556292791.611821 0.004223 0.004105 -- BerespProtocol HTTP/1.0 -- BerespStatus 200 -- BerespReason OK -- BerespHeader Server: SimpleHTTP/0.6 Python/3.7.3 -- BerespHeader Date: Fri, 26 Apr 2019 15:33:11 GMT -- BerespHeader Content-type: text/html; charset=utf-8 -- BerespHeader Content-Length: 12771 -- TTL RFC 120 10 0 1556292792 1556292792 1556292791 0 0 -- VCL_call BACKEND_RESPONSE -- TTL HFP 5 0 0 1556292792 -- VCL_return pass -- Storage malloc Transient -- ObjProtocol HTTP/1.0 -- ObjStatus 200 -- ObjReason OK -- ObjHeader Server: SimpleHTTP/0.6 Python/3.7.3 -- ObjHeader Date: Fri, 26 Apr 2019 15:33:11 GMT -- ObjHeader Content-type: text/html; charset=utf-8 -- ObjHeader Content-Length: 12771 -- Fetch_Body 3 length stream -- BackendClose 33 boot.default -- Timestamp BerespBody: 1556292791.612237 0.004639 0.000416 -- Length 12771 -- BereqAcct 135 0 135 156 12771 12927 -- End

That’s for one (ONE!) client request that then goes to the origin. You get timing information for each step of the way, you get caching decisions, connection data, header manipulation, YOU.GET.EVERYTHING.

You also get a powerful query system to filter slow requests, failed responses, transactions from a certain IP address, etc. Seriously, it’s insane, but it’s absolutely necessary when you need to find that one problematic request, to debug it.

All that data miiiight be a bit too much if you just need access logs, though. Access logs should be short and sweet, with just the information you need to create some reports, billing, and other light but necessary post-processing operations. For this, we have varnishncsa: which will output lines like:

127.0.0.1 - - [26/Apr/2019:08:33:11 -0700] "GET http://0:8080/ HTTP/1.1" 200 12771 "-" "curl/7.64.1"

As the name suggests, it supports the ncsa log format and allows you to tweak what goes into the line. Oh, and it supports the same excellent query system as varnishlog, just in case you want to filter requests per domain name, for example. It’s not uncommon to run multiple varnishncsa daemon, logging to different files to cover multiple properties/sites independently, for example.

Last but not least, there’s also JSON logging, which we announced a few weeks ago. It proposes a middle ground between the two previous tools. You get structured data, but you don’t necessarily need to handpick all the headers you want to know about:

$ varnishlog-json -p { "req": { "headers": { "host": "localhost:6081", "user-agent": "Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8", "accept-language": "en-US,en;q=0.5",

And because it’s JSON, it’s highly processable, since the language is ubiquitous. And at the risk of repeating myself, this one too supports query filters (they are all built atop the same library, so those benefits are shared).

 

What to do with all that data?

This, my friends, is a question for another blog post, the next one, actually. Varnish follows the UNIX philosophy, so it makes sure that the data is easily accessible and exported, leaving the transport and utilization of this data to the care of more specialized tools.

In part 2, we’ll explore integrations and see how we can plug all of this into a larger ecosystem in just a few steps. Stay tuned, and see you in a week or two!