PSA: You don't need that many regexes

HTTP is an intrinsically textual protocol, with relatively few rules. So it makes sense for Varnish to provide you with one of the best tools available to manipulate text: the regular expression, or "regex" (or "regexp", or "regexp?" if you want to be very clever).

But even though the general rules are few, and because HTTP has lived for so long, a number of its headers grew semantics of their own, making a general purpose tool like regexes more cumbersome and less efficient to use.

Fortunately, we have VMODs to handle those! And that's what we'll be covering in this piece, with examples and stuff!

Let's set up the playground

An example is worth a thousand words, if I remember correctly, so I'll use that to help you picture the problem.

We'll work from this fake yet realistic (well, if you exclude all those comments, of course) VCL:

sub vcl_recv {
	# sanitize A-L header, preferring french, then english, then danish
	if (req.http.accept-language ~ "fr") {
		set req.http.accept-language = "fr";
	} else if (req.http.accept-language ~ "en") {
		set req.http.accept-language = "en"
	} else {
		set req.http.accept-language = "da";
	}

	# uncacheable CMS paths
	if (req.url ~ "^/admin/" ||
	    req.url ~ "^/wp/admin") {
		return(pass);
	}
	# uncacheable e-commerce path
	if (req.url ~ "^/checkout/") {
		return(pass);
	}

	# remove "foo" and all "utm_something" query string parameters
	set req.url = regsuball(req.url, "(foo|utm_[a-z]+)=[-_A-z0-9+()%.]+&?", "");
	set req.url = regsub(req.url, "[?|&]+$", "");
	set req.url = std.querysort(req.url)

	# strip cookie totally if content is static, otherwise just remove "bar"
	# and all "__utmX" cookies
	if (req.url ~ "\.(jpg|png|css|html?|js)") {
		unset req.http.cookie;
	} else {
		# remove all cookies but COOKIE1 and COOKIE2
		set req.http.Cookie = ";" + req.http.Cookie;
		set req.http.Cookie = regsuball(req.http.Cookie, "; +", ";");
		set req.http.Cookie = regsuball(req.http.Cookie, ";(COOKIE1|COOKIE2)=", "; \1=");
		set req.http.Cookie = regsuball(req.http.Cookie, ";[^ ][^;]*", "");
		set req.http.Cookie = regsuball(req.http.Cookie, "^[; ]+|[; ]+$", "");

		if (req.http.Cookie == "") {
			unset req.http.Cookie;
		}
	}
	return (hash);
}

sub vcl_backend_response {
	# same thing as for vcl_recv, the backend may send a set-cookie header
	# that we need to clean
	if (bereq.url ~ "\.(jpg|png|css|html?|js)") {
		unset beresp.http.set-cookie;
	} else {
		set beresp.http.set-cookie = regsuball(beresp.http.set-cookie, "bar=[^;]+(; )?", "");
		set beresp.http.set-cookie = regsuball(beresp.http.set-cookie, "__utm.=[^;]+(; )?", "");
		# if we have an empty header, remove it
		if (beresp.http.set-cookie == "") {
			unset beresp.http.set-cookie;
		}
	}
	return (deliver);
}

sub vcl_hash {
	hash_data(req.http.cookie);
}

Acceptance in the first step

Let's focus on the first block of code:

if (req.http.accept-language ~ "fr") {
		set req.http.accept-language = "fr";
	} else if (req.http.accept-language ~ "en") {
		set req.http.accept-language = "en"
	} else {
		set req.http.accept-language = "da";
	}

Obviously, the goal is to normalize the "Accept-Language" header, which is a good idea: if the server is capable of serving multiple languages, it should respond with a "vary: accept-language" header, meaning that "Accept-Language" should be added to the hash key. It's pretty important then that this header is normalized to avoid duplicating cache objects.

Unfortunately, it doesn't really work; imagine a request comes in with this header:

Accept-Language: da;q=1, fr;q=0.1, en;q=0.9

Even though the user would prefer Danish, we are going to serve them French because that is what we checked for first and found, even though that's the least favorite option for the user. And sadly, regular expressions aren't equipped for this sort of logic...

Happily, vmod_accept is! We can rewrite the code like this:

sub vcl_init {
	new lang = accept.rule("da");
	lang.add("da");
	lang.add("fr");
	lang.add("en");
}

sub vcl_recv {
	set req.http.accept-language = lang.filter(req.http.accept-language);
}

This will respect the preference of the client, while still normalizing the header and presenting only one language to the server. Mission accomplished!

Use vmod_rewrite to... not rewrite?

When vmod_rewrite was created, the main goal was to have a tool to handle URL rewriting (for redirection) and host sanitizing by looking for a match in a list and if found, rewriting the string, but as always, we try to write generic tools, and we ended up splitting the matching and rewriting features.

Surprise! The former is going to be useful to replace this code:

	# uncacheable CMS paths
	if (req.url ~ "^/admin/" ||
	    req.url ~ "^/wp/admin") {
		return(pass);
	}
	# uncacheable e-commerce path
	if (req.url ~ "^/checkout/") {
		return(pass);
	}

In every platform there are paths that aren't going to be cacheable, so we can just check the beginning of the URL and pass (i.e. not cache) if that's one of those uncacheable cases. This is what is done in the code above, but you can see that we are systematically using an anchor "^" to express "the beginning of the string". With vmod_rewrite, we don't need it:

sub vcl_init {
	new pass_paths = rewrite.ruleset(string = {"
	    # uncacheable CMS paths
	    "/admin/"
	    "/wp/admin/"
	    # uncacheable e-commerce path
	    "/checkout/"
	"}, min_fields = 1, type = prefix);
}

sub vcl_recv {
	if (pass_paths.match(req.url)) {
		return(pass);
	}
}
We just tucked all the paths in an object, without losing our comments - that's pretty neat. We do add a level of indirection, but that clarifies the code, and proper naming helps to understand why we are passing, even if we don't have the actual list in vcl_recv.

Filtering the query string

Next up is query string sanitation, which is pretty important if you want to avoid cache explosion at the hand of script kiddies (how's that for a dramatic intro?). The issue is that the query string is part of the URL, and Varnish uses it as part of the hash key, so all these URLs will end up be cached independently:

  • /foo?a=1&b=2&c=3
  • /foo?c=3&b=2&c=1
  • /foo?a=1&b=2&c=3&d=4

The first and second URLs should definitely be the same object, and if the page doesn't depend on the "d" parameter, then we are looking at the same object duplicated three times.

vmod_std has the .querysort() function that's already going to provide some benefits, but if you want to filter, chances are that you'll resort to something like this:

# remove "foo" and all "utm_something" query string parameters
set req.url = regsuball(req.url, "(foo|utm_[a-z]+)=[-_A-z0-9+()%.]+&?", "");
set req.url = regsub(req.url, "[?|&]+$", "");
set req.url = std.querysort(req.url);

If it makes your eyes bleed, congratulations, you are human! If it doesn't bother you because "hey, it gets the job done", congratulations, you're a sysadmin! Even then, you should be able to appreciate the tidiness brought by the urlplus vmod:

urlplus.query_delete("foo");
urlplus.query_delete_regex("utm_[a-z]+");
urlplus.write();

Note that we are still using a regex here, but only to describe the name of the cookies we care about, not to work around semantics.

Not a tough cookie

Here, we'll break linearity a bit and jump to the cookie handling bits:

# remove all cookies but COOKIE1 and COOKIE2
set req.http.Cookie = ";" + req.http.Cookie;
set req.http.Cookie = regsuball(req.http.Cookie, "; +", ";");
set req.http.Cookie = regsuball(req.http.Cookie, ";(COOKIE1|COOKIE2)=", "; \1=");
set req.http.Cookie = regsuball(req.http.Cookie, ";[^ ][^;]*", "");
set req.http.Cookie = regsuball(req.http.Cookie, "^[; ]+|[; ]+$", "");

if (req.http.Cookie == "") {
    unset req.http.Cookie;
}

This is properly atrocious, but thankfully, same as for the query string: "there's a VMOD for that" (I should patent that, I'd make millions!), and its name is vmod_cookieplus. Here's what we can replace the code with:

cookieplus.keep("COOKIE1");
cookieplus.keep("COOKIE2");
cookieplus.write();
Actually, I'm almost a bit sad by how anticlimactic this is, there's nothing to explain, no trick to take care of, it just... works.

It's a nice file; it's just not my type

By default, Varnish won't cache requests with cookie, because it has no context information, and it doesn't want to risk caching your bank statement and serving it to everyone (it's apparently frowned upon).

One of the first things you do to raise your hit-ratio is to force the caching of static files, identified by their extension, by nuking the cookie (from users) and set-cookie (from backends) headers, like so:

sub vcl_recv {
    if (req.url ~ "\.(jpg|png|css|html?|js)") {
		unset req.http.cookie;
	}
}
sub vcl_backend_response {
    if (bereq.url ~ "\.(jpg|png|css|html?|js)") {
		unset beresp.http.set-cookie;
	}
}

It sort of works well, except for some corner cases like "/uncacheable.php?img=avatar.png" or "/cacheable.pnG", but by trying hard enough, a regex can be written. The real issue, besides readability is the redundancy, and not of the good kind. Chances are that when someone tries to update the list of static extensions, they'll only touch one of the blocks, leading to subpar performance.

For this task, we'll use three VMODs we already know:

  • urlplus to extract the extension
  • std to convert said extension to lowercase (easier matching)
  • rewrite to match that string against known cacheable extensions

This leads to this VCL:

sub vcl_init {
	new static = rewrite.ruleset(string = {"
	    "jpg"
	    "png"
	    "css"
	    "htm"
	    "html"
	    "js"
	"}, min_fields = 1, type = exact);
}

sub vcl_recv {
	if (static.match(std.tolower(urlplus.get_extension()))) {
		unset req.http.cookie;
	}
}

sub vcl_backend_response {
	if (static.match(std.tolower(urlplus.get_extension()))) {
		unset beresp.http.set-cookie;
	}
}
Again, we add a layer of indirection, but that list only exists once.

The full picture

All the pieces are in place, and here's the full VCL:

import accept;
import cookieplus;
import rewrite;
import std;
import urlplus;

sub vcl_init {
	# accepts english, french, and the default, danish
	new lang = accept.rule("da");
	lang.add("da");
	lang.add("fr");
	lang.add("en");

	# uncacheable paths
	new pass_paths = rewrite.ruleset(string = {"
	    # uncacheable CMS paths
	    "/admin/"
	    "/wp/admin/"
	    # uncacheable e-commerce path
	    "/checkout/"
	"}, min_fields = 1, type = prefix);

	# static content
	new static = rewrite.ruleset(string = {"
	    "jpg"
	    "png"
	    "css"
	    "htm"
	    "html"
	    "js"
	"}, min_fields = 1, type = exact);
}

sub vcl_recv {
	# clean the accept-language header
	set req.http.accept-language = lang.filter(req.http.accept-language);

	# bypass the cache if in one of the uncacheable subtrees
	if (pass_paths.match(req.url)) {
		return(pass);
	}

	# delete "foo" and all "utm_something" query string parameters
	urlplus.query_delete("foo");
	urlplus.query_delete_regex("utm_[a-z]+");
	urlplus.write();

	# no cookie for static content, and only COOKIE1 and COOKIE2
	# for the rest
	if (static.match(std.lower(urlplus.get_extension()))) {
		unset req.http.cookie;
	} else {
		cookieplus.keep("COOKIE1");
		cookieplus.keep("COOKIE2");
		cookieplus.write();
	}
}

sub vcl_backend_response {
	if (pass_paths.match(bereq.url)) {
		return(deliver);
	} else if (static.match(std.lower(urlplus.get_extension()))) {
		unset beresp.http.set-cookie;
	} else {
		cookieplus.setcookie_keep("COOKIE1");
		cookieplus.setcookie_keep("COOKIE2");
		cookieplus.setcookie_write();
	}
}

It's not shorter than the old one (it's even a tiny bit longer); however, it's way cleaner and more maintainable, which is what we should aim for when writing code for a critical component, such as Varnish.

Also, note that all the string lists we are using inline to create vmod_rewrite objects could be placed inside independent files. That's a great option if they start growing too much, or if you want to modify them without touching the VCL.

The point I wanted to make in this post is obviously that regexes are not necessarily the best tool for all jobs ("when all you have is regexes, everything looks like a string"), but I think we've seen a collateral too: VCL, by its programming nature, encourages us to compose basic tools together to achieve our goals. As proof, see how we used urlplus to decide whether or not we should trigger some cookieplus logic.

That's all for today, but if you see something that isn't crystal clear, hit me up in the comments, or on Twitter; this has been a beefy walkthrough and I had to streamline a few things, but I'd be happy to dig deeper with you.

 

Photo (c) 2012 deux-chi used under Creative Commons license.

 

  

22/08/18 13:30 by Guillaume Quintard

All things Varnish related

The Varnish blog is where the our team writes about all things related to Varnish Cache and Varnish Software...or simply vents.

SUBSCRIBE TO OUR BLOG

Recent Posts

Posts by Topic

see all

Varnish Software Blog