May 29, 2024
8 min read time

The annoying (but sane) impossibility of transparent caching

The annoying (but sane) impossibility of transparent caching
9:39

 

Have a seat, we need to have a serious discussion. Don’t worry, you’re not in trouble, but we’ve got to talk. Well, actually, we need to have the talk. You know, the one you never want to have and just hope they’ll figure it out by themselves? Yeah, that talk.

Today, rather than explaining all the wonderful things Varnish can do, we have to explore one big thing it cannot: transparent caching. Not because it's bad or even lacking features, but because well, “that’s just not how things work”. As a result, we won’t be talking about Varnish much, and rather about network concepts in general. But I’ll make it educational AND fun, I promise!

With that said, let’s embark on a journey to understand how things work correctly, by preventing us from achieving our dreams.

 

Transparent caching: why and why not?

Alright, let’s agree on the terms here: “transparent caching” in this article will designate a combination of abilities:

  • Answering to HTTP(S) requests from unspecified users
  • Being able to read individual requests and responses for future reuse
  • Fetching the responses to these request from one or more HTTP(S) origins
  • Doing all that without the users having to opt-in, i.e., without them having to change or configure anything

And this is probably where you start seeing why transparent caching would be catastrophic: if one is able to do it, they can spy and manipulate any HTTP traffic without the “users” (read “victims”) being even aware of what is happening. We’re talking any HTTP communication, to your bank, to your doctor’s office, to YouTube, to TikTok, etc.

If you’re thinking “well, not me, I have a VPN”, do read on ;-)

I think we can agree that the possibility of transparent caching is pretty bad, so now we can go and dig a bit into why that doesn’t work, and if there are any compromises that can happen.

 

DNS a.k.a. the “network phone book”

Without going into the boring details, there’s one thing we need to know: your computer doesn’t actually connect to “google.com” or “facebook.com” (domain names), instead it uses IP addresses like “123.45.25.210” or “a1:26:cc::56:f4”. To translate from one to another, computers use the Domain Name System (DNS), allowing humans to retain simple names while still letting computers find ever changing network addresses.

If you want to transparently cache traffic to, say example.com, that means you need to trick users into connecting to your server’s IP address. That’s not really easy, and it’s becoming even harder. In practice, you need to be somewhere between the user and the DNS server, see the client requests for a name resolution and just reply faster than the legitimate server.

That’s doable, in some cases, but more and more browsers are moving the DNS over HTTPS (DoH) to prevent this sort of shenanigans, and this will cause the same kind of headaches that we’ll see with TLS down below.

Corporate systems  and VPNs are a good example of DNS trickery. Most of the time, they can configure the users’ computer to use their own DNS server, so they can steer traffic where they want. It’s one step towards transparent caching, but we’ll see that it’s not enough. Also, one may argue against the transparency of the process, as even though the final user may not know about the trickery, the IT department had to explicitly change the machine configuration.

TL;DR: if you don’t have DNS control, the users will connect to the real site, not yours.

 

TLS annoyance part 1: the encryption

You know what, all this is pretty annoying, so let’s try something simpler and focus on the second item on our list: can we just spy on traffic? Like, that should be easy, right?

It was, truly, with HTTP. But not anymore, now everything is HTTPS, with that S standing for “Secure (through TLS and asymmetrical encryption)”. Annoyingly, that means that if we aren’t part of the negotiation between the client and the server, we won’t be privy to the conversation.

By that, I mean that all we’ll see is a stream of garbled bits, with no way to know what’s a request header or a body, or a flow control message (for HTTP/2). And if you want to cache HTTP content, you absolutely need to discern HTTP elements in the stream.

The obvious point to explore here are HTTP and HTTPS forward-proxies. Typically, users opt into using them to access the internet (so it’s not transparent) and in the case of HTTP, we can cache, because there’s no encryption, and Varnish can actually do that very well. However, the crushing majority of the traffic now is through HTTPS, and the best an HTTPS proxy can do is:

  • Accept the client connection via HTTP
  • Read the request headers to see what the actual destination should be
  • Open a connection to that site
  • Just pass garbled bits back-and-forth, without understanding what they mean

    TL;DR: without being part of the TLS conversation, the HTTP bits are all scrambled, making caching impossible.

 

TLS annoyance part 2: the trust

Ok, but what if we are part of the TLS conversation? Again, TLS is here to foil our plans, preventing us from sneaking our way into discussions we have no part in.

Let’s say that we managed to steer a user to our server, and they try to open a secure connection, thinking they are talking to example.com. The gig is unfortunately going to be up before long because our server is going to look like you, when you tried to forge your mom’s writing to get out of PE.

See, TLS revolves around certificates, and in short, the server is going to present its certificate that will convey three main things:

  • The domains it serves
  • Some cryptographic bits needed to encrypt the future discussion
  • A chain of trust

And that the last point that bites us: essentially, it’s a verifiable statement that says something like “I’m trustworthy because this authority’s certificate trusted me, and this one was trusted by this even trustworthier certificate, and THIS one is trusted by …”, that’s why it’s called a chain.

Note: it’s super easy to create a self-signed certificate (i.e. without a trust chain):

openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 3650 -nodes -subj "/C=XX/ST=StateName/L=CityName/O=CompanyName/OU=CompanySectionName/CN=example.com"

Upon seeing this, the client will check if any certificate in the chain is part of its “root” certificate collection, that is, the pool of certificates it knows it can trust. As you can imagine, it’s pretty darn hard to get that endorsement if you don’t actually own the domain you pretend to be serving (example.com in our case).

Long story short, the client is just going to scoff at you and disconnect immediately.

And that’s where VPNs come into play. A lot of VPNs offer protection from viruses and other threats, but to do this, they need to see what’s at the HTTP level, so what’s the solution? Very simply, they ask you to install a root certificate they created. This way, they can fake being myspace.com or geocities.com (I’m running out of sites for example, I admit) by crafting their own certificates with a chain of trust that includes their own root certificate.

So to benefit from the privacy of a VPN, you usually relinquish your privacy to them.

That’s also how some very controlling corporate networks or countries operate: the only way to access the internet is through servers they control, after having installed a root certificate they own, offering them a peek at your traffic.

TL;DR: if you don’t own the website you won’t have a valid certificate, and clients will not trust you (unless they agreed to install a root certificate).

 

Caching EVERYTHING can be tricky

If you are able to compromise on the transparency, as we’ve seen above, you can finally get to the fun part of the equation: caching!

Did I say “fun”? Maybe I meant “challenging”... Caching data you don’t own is hard to get right. HTTP is 30-35 years old, so we are dealing with a lot of bad practices, legacy code and “good enough” behaviors, and you can’t trust every single website to be clear about what is cacheable and what isn’t. As a result, and if you are not careful enough, you can:

  • undercache: not cache long enough or at all content that could have been saved, wasting resources
  • overcache: the opposite, keeping in cache old content, or uncacheable data, resulting in stale objects being delivered to the users
  • cause cache collisions: the worst of them all, where different objects are stored under the same hash, causing all sorts of chaos, like delivering confidential pages to the wrong user (:shudders:).

If you think about it, this is also why VCL in Varnish was born: because we could not rely on a set of well-defined guidelines and configuration options, we needed a tool that would give absolute control over caching, so the user could tweak, fiddle and fix data and behaviors at the caching level, rather than over origins they may not control.

TL;DR: you need to make sure you know what you are caching and that you have the necessary tools to define complex caching behaviors

It's a wrap

And now you know! It was a bit different from my other posts, so I definitely hope you enjoyed it. Caching doesn’t live in a vacuum and it’s important to understand the context around it, even if it can be a bit overwhelming at times.

Want to learn more? Download Varnish 6 By Example, a practical book full of tips and best practices for getting the most out of your Varnish setup and reaching new heights in your caching operations, whether you’re new to Varnish or an experienced pro.

Varnish 6 Book