September 11, 2024
7 min read time

Varnish and Observability (Part 2: Grafana and Friends)

Varnish and Observability (Part 2: Grafana and Friends)
6:17

In our previous post, Varnish and Observability (Part 1: The Basics), we explored the Varnish ecosystem of core tools dedicated to monitoring and caught a glimpse of how much information we can extract from our setup(s), but we also left a question unanswered: then what?

Having a lot of information is great, but having it isn’t enough, you need to store it, access it, and make sense of it. This grows beyond what Varnish does, and there are thousands of different approaches to it, so today, I’d like to present a tiny thing that we just released, and some thoughts about it.

A Grafana reference architecture

Let’s go: we’ve built a small docker compose setup to demonstrate how Varnish can work with some Grafana tools. We’ll try to keep this blog post relatively light on technical details to keep it accessible, but don’t worry! If you want to dive deep and tinker: we have an extensive tutorial on the developer portal based on that minimal (but functional) reference architecture. With these two, you can get started in minutes!

As a high-level overview, this will provide you with the full monty:

  • A running Varnish image
  • An origin serving default content (but you can override it)
  • A simple load generator that you can tinker with
  • Prometheus
  • Loki
  • Metrics and logs exporters
  • Grafana

Observability Blog 2 - Grafana Reference ArchitectureThe full monty, I tell ye!

And it’s easy to use too:

$ docker compose up [+] Running 8/0 ✔ Container grafana-monitoring-promtail-1 Created 0.0s ✔ Container grafana-monitoring-prometheus-1 Created 0.0s ✔ Container grafana-monitoring-grafana-1 Created 0.0s ✔ Container grafana-monitoring-loki-1 Created 0.0s ✔ Container grafana-monitoring-origin-1 Created 0.0s ✔ Container grafana-monitoring-varnish-1 Created 0.0s ✔ Container grafana-monitoring-load_generator-1 Created 0.0s ✔ Container grafana-monitoring-exporter-1 Created 0.0s Attaching to exporter-1, grafana-1, load_generator-1, loki-1, origin-1, prometheus-1, promtail-1, varnish-1 promtail-1 | level=info ts=2024-07-03T10:14:21.955044284Z caller=promtail.go:133 msg="Reloading configuration file" md5sum=6e599e38cb8ade2354b745a605524aa9 promtail-1 | level=info ts=2024-07-03T10:14:21.957054448Z caller=server.go:334 http=[::]:80 grpc=[::]:9095 msg="server listening on addresses" promtail-1 | level=info ts=2024-07-03T10:14:21.958268152Z caller=main.go:174 msg="Starting Promtail" version="(version=2.8.7, branch=HEAD, revision=1dfdc432c)" promtail-1 | level=warn ts=2024-07-03T10:14:21.958525379Z caller=promtail.go:265 msg="enable watchConfig" loki-1 | level=warn ts=2024-07-03T10:14:21.977327096Z caller=loki.go:288 msg="global timeout not configured, using default engine timeout (\"5m0s\"). This behavior will change in the next major to always use the default global timeout (\"5m\")." loki-1 | level=info ts=2024-07-03T10:14:21.981822459Z caller=main.go:108 msg="Starting Loki" version="(version=2.9.0, branch=HEAD, revision=2feb64f69)" loki-1 | level=info ts=2024-07-03T10:14:21.98476341Z caller=server.go:322 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses" loki-1 | level=warn ts=2024-07-03T10:14:21.989257745Z caller=cache.go:127 msg="fifocache config is deprecated. use embedded-cache instead" loki-1 | level=warn ts=2024-07-03T10:14:21.990085316Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache - chunksembedded-cache" ...

Now, if you head over to http://localhost:3000 in your browser and log in with admin/password you should see something like:

observability part 2 - dashboard

That seems like a lot of moving parts

It is, but at the same time, each component does a pretty specific thing, and thanks to docker compose we can isolate each of them inside their own container for maximum clarity. Also, because we wanted this to be a reference architecture, we didn’t want to skimp on details that we feel are important to get the big picture.

In truth, though, the logic of the setup is fairly simple:

  • There’s the system to observe (varnish), and the bare minimum in terms of infrastructure around it (an origin and a client)
  • Of course, storage for both metrics (prometheus) and logs (loki)
  • We also need transport to get data to storage (those are our exporters)
  • Finally we have grafana as a UI

By now I’m sure you’ve heard the term “ELK”, an acronym describing the same kind of framework, with the same kind of split: ElasticSearch for the storage, Logstash for the transport and Kibana for the UI.

To me, that’s the exciting part: the system is built with bricks, and sure, you need to make sure all the pieces fit, but it’s incredibly modular. For example, grafana isn’t tied to prometheus and loki, it can present data from ElasticSearch and graphite. You could also use fluentbit for transport to both loki and prometheus.

It can be overwhelming at first, but as each element is fairly contained, it’s easy to grasp each one individually and the whole setup can evolve much more gradually than with a big monolith.

Nothing exists in a vacuum

Creating pedagogical material has always been a fascinating exercise to me, for a couple of reasons. First, teaching helps you learn and tests your knowledge in ways even real-life experience doesn't. You have to intimately understand a topic to explain it correctly. Second, and that’s the point of this section: you have to balance the needs of your audience and what is the most important: is it understanding the topic in depth, or is it being proficient in it. The two are linked, but they are not the same, and teaching often boils down to knowing how much you can avoid mentioning.

That’s a very important point for reference architectures, and this one in particular. To provide an environment that is easy to understand and to play with, we resisted the urge to explain, comment and dissect everything. Instead we rely on the curiosity of the reader, as you can see from the tutorial where we constantly link to relevant documentation.

In the case of observability, it’s important to understand that monitoring Varnish isn’t enough, and that the reference architecture makes a point of ONLY monitoring Varnish, for teachability.

What about observability platforms as a service?

You know, like New Relic. Datadog or Dynatrace? It turns out that Varnish has great integration with those too (click the links, those are tutorials for each platform)! The nice thing about those platforms is the ease of use: essentially you only need to care about a single extra component: a reporter agent.

That agent, once told to monitor Varnish, will happily push data to the service and you will be able to set up dashboards and alerts within minutes. It’s almost too easy! I’m joking of course, but in that light you can understand why I presented Grafana first: as a teaching tool, those platforms aren’t that “interesting” since a lot is hidden away (which is a feature, not a bug).

The power of friendship (and keeping an eye on your friends)

As much as I emphasized Varnish’s complete transparency (there’s a Full Monty joke in here, somewhere), you can’t ignore the rest of your infrastructure and you should watch it too, like a hawk.

The great value of monitoring and observability comes from data correlation. Knowing that your hit-ratio is decreasing, or that the traffic is increasing on Varnish is nice, but the true power comes from looking at your whole system, and seeing how all components react to various events. So, grab all the data you can, from all the systems you can, and expose them, patterns will emerge and it will make everything run smoother.

That’s a lot of words to say that observability is a pretty long journey to embark on, but it’s also a very valuable one, and hopefully, we at Varnish can make it easy to start!

Need more data?

We’ve seen how Varnish exposes both metrics and logs, and we have tutorials on how to leverage them via a handful of tools. There’s one last question we need to tackle, an almost unimaginable one: “What if you need more metrics and logging capabilities?”

Most other servers would just tell you that you are being greedy, but not Varnish, because it has multiple answers to this question. But those will have to wait until part 3 of this series.