HTTP is an intrinsically textual protocol, with relatively few rules. So it makes sense for Varnish to provide you with one of the best tools available to manipulate text: the regular expression, or "regex" (or "regexp", or "regexp?" if you want to be very clever).
But even though the general rules are few, and because HTTP has lived for so long, a number of its headers grew semantics of their own, making a general purpose tool like regexes more cumbersome and less efficient to use.
Fortunately, we have VMODs to handle those! And that's what we'll be covering in this piece, with examples and stuff!
Let's set up the playground
An example is worth a thousand words, if I remember correctly, so I'll use that to help you picture the problem.
We'll work from this fake yet realistic (well, if you exclude all those comments, of course) VCL:
sub vcl_recv {
# sanitize A-L header, preferring french, then english, then danish
if (req.http.accept-language ~ "fr") {
set req.http.accept-language = "fr";
} else if (req.http.accept-language ~ "en") {
set req.http.accept-language = "en"
} else {
set req.http.accept-language = "da";
}
# uncacheable CMS paths
if (req.url ~ "^/admin/" ||
req.url ~ "^/wp/admin") {
return(pass);
}
# uncacheable e-commerce path
if (req.url ~ "^/checkout/") {
return(pass);
}
# remove "foo" and all "utm_something" query string parameters
set req.url = regsuball(req.url, "(foo|utm_[a-z]+)=[-_A-z0-9+()%.]+&?", "");
set req.url = regsub(req.url, "[?|&]+$", "");
set req.url = std.querysort(req.url)
# strip cookie totally if content is static, otherwise just remove "bar"
# and all "__utmX" cookies
if (req.url ~ "\.(jpg|png|css|html?|js)") {
unset req.http.cookie;
} else {
# remove all cookies but COOKIE1 and COOKIE2
set req.http.Cookie = ";" + req.http.Cookie;
set req.http.Cookie = regsuball(req.http.Cookie, "; +", ";");
set req.http.Cookie = regsuball(req.http.Cookie, ";(COOKIE1|COOKIE2)=", "; \1=");
set req.http.Cookie = regsuball(req.http.Cookie, ";[^ ][^;]*", "");
set req.http.Cookie = regsuball(req.http.Cookie, "^[; ]+|[; ]+$", "");
if (req.http.Cookie == "") {
unset req.http.Cookie;
}
}
return (hash);
}
sub vcl_backend_response {
# same thing as for vcl_recv, the backend may send a set-cookie header
# that we need to clean
if (bereq.url ~ "\.(jpg|png|css|html?|js)") {
unset beresp.http.set-cookie;
} else {
set beresp.http.set-cookie = regsuball(beresp.http.set-cookie, "bar=[^;]+(; )?", "");
set beresp.http.set-cookie = regsuball(beresp.http.set-cookie, "__utm.=[^;]+(; )?", "");
# if we have an empty header, remove it
if (beresp.http.set-cookie == "") {
unset beresp.http.set-cookie;
}
}
return (deliver);
}
sub vcl_hash {
hash_data(req.http.cookie);
}
Acceptance in the first step
Let's focus on the first block of code:
if (req.http.accept-language ~ "fr") {
set req.http.accept-language = "fr";
} else if (req.http.accept-language ~ "en") {
set req.http.accept-language = "en"
} else {
set req.http.accept-language = "da";
}
Obviously, the goal is to normalize the "Accept-Language
" header, which is a good idea: if the server is capable of serving multiple languages, it should respond with a "vary: accept-language
" header, meaning that "Accept-Language
" should be added to the hash key. It's pretty important then that this header is normalized to avoid duplicating cache objects.
Unfortunately, it doesn't really work; imagine a request comes in with this header:
Accept-Language: da;q=1, fr;q=0.1, en;q=0.9
Even though the user would prefer Danish, we are going to serve them French because that is what we checked for first and found, even though that's the least favorite option for the user. And sadly, regular expressions aren't equipped for this sort of logic...
Happily, vmod_accept is! We can rewrite the code like this:
sub vcl_init {
new lang = accept.rule("da");
lang.add("da");
lang.add("fr");
lang.add("en");
}
sub vcl_recv {
set req.http.accept-language = lang.filter(req.http.accept-language);
}
This will respect the preference of the client, while still normalizing the header and presenting only one language to the server. Mission accomplished!
Use vmod_rewrite to... not rewrite?
When vmod_rewrite was created, the main goal was to have a tool to handle URL rewriting (for redirection) and host sanitizing by looking for a match in a list and if found, rewriting the string, but as always, we try to write generic tools, and we ended up splitting the matching and rewriting features.
Surprise! The former is going to be useful to replace this code:
# uncacheable CMS paths
if (req.url ~ "^/admin/" ||
req.url ~ "^/wp/admin") {
return(pass);
}
# uncacheable e-commerce path
if (req.url ~ "^/checkout/") {
return(pass);
}
In every platform there are paths that aren't going to be cacheable, so we can just check the beginning of the URL and pass (i.e. not cache) if that's one of those uncacheable cases. This is what is done in the code above, but you can see that we are systematically using an anchor "^" to express "the beginning of the string". With vmod_rewrite, we don't need it:
sub vcl_init {
new pass_paths = rewrite.ruleset(string = {"
# uncacheable CMS paths
"/admin/"
"/wp/admin/"
# uncacheable e-commerce path
"/checkout/"
"}, min_fields = 1, type = prefix);
}
sub vcl_recv {
if (pass_paths.match(req.url)) {
return(pass);
}
}
We just tucked all the paths in an object, without losing our comments - that's pretty neat. We do add a level of indirection, but that clarifies the code, and proper naming helps to understand why we are passing, even if we don't have the actual list in vcl_recv
.
Filtering the query string
Next up is query string sanitation, which is pretty important if you want to avoid cache explosion at the hand of script kiddies (how's that for a dramatic intro?). The issue is that the query string is part of the URL, and Varnish uses it as part of the hash key, so all these URLs will end up be cached independently:
- /foo?a=1&b=2&c=3
- /foo?c=3&b=2&c=1
- /foo?a=1&b=2&c=3&d=4
The first and second URLs should definitely be the same object, and if the page doesn't depend on the "d" parameter, then we are looking at the same object duplicated three times.
vmod_std
has the .querysort()
function that's already going to provide some benefits, but if you want to filter, chances are that you'll resort to something like this:
# remove "foo" and all "utm_something" query string parameters
set req.url = regsuball(req.url, "(foo|utm_[a-z]+)=[-_A-z0-9+()%.]+&?", "");
set req.url = regsub(req.url, "[?|&]+$", "");
set req.url = std.querysort(req.url);
If it makes your eyes bleed, congratulations, you are human! If it doesn't bother you because "hey, it gets the job done", congratulations, you're a sysadmin! Even then, you should be able to appreciate the tidiness brought by the urlplus vmod:
urlplus.query_delete("foo");
urlplus.query_delete_regex("utm_[a-z]+");
urlplus.write();
Note that we are still using a regex here, but only to describe the name of the cookies we care about, not to work around semantics.
Not a tough cookie
Here, we'll break linearity a bit and jump to the cookie handling bits:
# remove all cookies but COOKIE1 and COOKIE2
set req.http.Cookie = ";" + req.http.Cookie;
set req.http.Cookie = regsuball(req.http.Cookie, "; +", ";");
set req.http.Cookie = regsuball(req.http.Cookie, ";(COOKIE1|COOKIE2)=", "; \1=");
set req.http.Cookie = regsuball(req.http.Cookie, ";[^ ][^;]*", "");
set req.http.Cookie = regsuball(req.http.Cookie, "^[; ]+|[; ]+$", "");
if (req.http.Cookie == "") {
unset req.http.Cookie;
}
This is properly atrocious, but thankfully, same as for the query string: "there's a VMOD for that" (I should patent that, I'd make millions!), and its name is vmod_cookieplus. Here's what we can replace the code with:
cookieplus.keep("COOKIE1");
cookieplus.keep("COOKIE2");
cookieplus.write();
Actually, I'm almost a bit sad by how anticlimactic this is, there's nothing to explain, no trick to take care of, it just... works.
It's a nice file; it's just not my type
By default, Varnish won't cache requests with cookie, because it has no context information, and it doesn't want to risk caching your bank statement and serving it to everyone (it's apparently frowned upon).
One of the first things you do to raise your hit-ratio is to force the caching of static files, identified by their extension, by nuking the cookie (from users) and set-cookie (from backends) headers, like so:
sub vcl_recv {
if (req.url ~ "\.(jpg|png|css|html?|js)") {
unset req.http.cookie;
}
}
sub vcl_backend_response {
if (bereq.url ~ "\.(jpg|png|css|html?|js)") {
unset beresp.http.set-cookie;
}
}
It sort of works well, except for some corner cases like "/uncacheable.php?img=avatar.png
" or "/cacheable.pnG
", but by trying hard enough, a regex can be written. The real issue, besides readability is the redundancy, and not of the good kind. Chances are that when someone tries to update the list of static extensions, they'll only touch one of the blocks, leading to subpar performance.
For this task, we'll use three VMODs we already know:
- urlplus to extract the extension
- std to convert said extension to lowercase (easier matching)
- rewrite to match that string against known cacheable extensions
This leads to this VCL:
sub vcl_init {
new static = rewrite.ruleset(string = {"
"jpg"
"png"
"css"
"htm"
"html"
"js"
"}, min_fields = 1, type = exact);
}
sub vcl_recv {
if (static.match(std.tolower(urlplus.get_extension()))) {
unset req.http.cookie;
}
}
sub vcl_backend_response {
if (static.match(std.tolower(urlplus.get_extension()))) {
unset beresp.http.set-cookie;
}
}
Again, we add a layer of indirection, but that list only exists once.
The full picture
All the pieces are in place, and here's the full VCL:
import accept;
import cookieplus;
import rewrite;
import std;
import urlplus;
sub vcl_init {
# accepts english, french, and the default, danish
new lang = accept.rule("da");
lang.add("da");
lang.add("fr");
lang.add("en");
# uncacheable paths
new pass_paths = rewrite.ruleset(string = {"
# uncacheable CMS paths
"/admin/"
"/wp/admin/"
# uncacheable e-commerce path
"/checkout/"
"}, min_fields = 1, type = prefix);
# static content
new static = rewrite.ruleset(string = {"
"jpg"
"png"
"css"
"htm"
"html"
"js"
"}, min_fields = 1, type = exact);
}
sub vcl_recv {
# clean the accept-language header
set req.http.accept-language = lang.filter(req.http.accept-language);
# bypass the cache if in one of the uncacheable subtrees
if (pass_paths.match(req.url)) {
return(pass);
}
# delete "foo" and all "utm_something" query string parameters
urlplus.query_delete("foo");
urlplus.query_delete_regex("utm_[a-z]+");
urlplus.write();
# no cookie for static content, and only COOKIE1 and COOKIE2
# for the rest
if (static.match(std.lower(urlplus.get_extension()))) {
unset req.http.cookie;
} else {
cookieplus.keep("COOKIE1");
cookieplus.keep("COOKIE2");
cookieplus.write();
}
}
sub vcl_backend_response {
if (pass_paths.match(bereq.url)) {
return(deliver);
} else if (static.match(std.lower(urlplus.get_extension()))) {
unset beresp.http.set-cookie;
} else {
cookieplus.setcookie_keep("COOKIE1");
cookieplus.setcookie_keep("COOKIE2");
cookieplus.setcookie_write();
}
}
It's not shorter than the old one (it's even a tiny bit longer); however, it's way cleaner and more maintainable, which is what we should aim for when writing code for a critical component, such as Varnish.
Also, note that all the string lists we are using inline to create vmod_rewrite
objects could be placed inside independent files. That's a great option if they start growing too much, or if you want to modify them without touching the VCL.
The point I wanted to make in this post is obviously that regexes are not necessarily the best tool for all jobs ("when all you have is regexes, everything looks like a string"), but I think we've seen a collateral too: VCL, by its programming nature, encourages us to compose basic tools together to achieve our goals. As proof, see how we used urlplus
to decide whether or not we should trigger some cookieplus
logic.
That's all for today, but if you see something that isn't crystal clear, hit me up in the comments, or on Twitter; this has been a beefy walkthrough and I had to streamline a few things, but I'd be happy to dig deeper with you.
Photo (c) 2012 deux-chi used under Creative Commons license.