Put your URLs up, and keep'em where I can see'em: rewriting URLs with Varnish

Note: this article is the first in a two-part series, you can find the second post here.

Cool urls don't change, it is said. And according to itself, that URL must be pretty cool, because it hasn't changed in 18 years! The gits[1] of it is: "the net is vast and infinite" and people will reference your site so you should be a good citizen and not shuffle resource URLs because we'll get dangling links and dangling links are bad, m'kay?

Not changing URLs is the best thing to do, and Varnish can help you with that, but there are cases where it's not applicable, and Varnish can help too! Let's dive into the world of URL rewriting and HTTP redirection. This post will cover URL rewriting, and the next post in the series will focus on 301, 302 and all sort of things about HTTP redirection.

Rewriting URLs

Stop me if you've heard this one before: you were using CMS-A but realized (or management realized it for you) that it just wasn't good enough, so you are now migrating to CMS-B. Both platforms are pretty strict about locations and they are of course incompatible! One wants images to be reached from /image/, the other from /images/, and articles that were in /cmsa/post/ should now be accessible from /content/articles, and so on.

Migration was mostly ok (you used a custom script to migrate) but clearly missed a few spots, plus the internet is full of links pointing to the old URLs instead of the new ones.

VTC-Driven Development

Instead of brutally changing the file locations, we cas use Varnish to rewrite URLs from the old locations to the new.

Let's take a hint from programmers here, and apply some TDD (Test-Driven Development) philosophy. First, let's specify what we want. In our case, we'd like Varnish to rewrite:

  • URLs looking like /cmsa/post/* into /content/articles/*
  • URLs looking like /image/* into /images/*

This sounds easy enough, let's write rewriting.vtc:

varnishtest "Testing URL rewriting"

server s1 {
    rxreq
	txresp
	expect req.url == "/content/articles/my_post.html"

	rxreq
	txresp
	expect req.url == "/image/cutter_otter.jpg"
} -start

varnish v1 -vcl+backend {
    # VCL logic goes here
} -start

client c1 {
	txreq -url "/cmsa/post/my_post.html"
	rxresp

	txreq -url "/images/cutter_otter.jpg"
	rxresp
} -run

This spawns a server, a Varnish, then runs a client, with a few notable points:

  • v1 and s1 are started and run in the background (-start), c1 is started and we then wait for it to return (-run)
  • s1 and c1 are "fake" HTTP server and client, running a minimal HTTP stack, while Varnish is a real instance
  • -vcl+backend automatically creates a vcl with "vcl 4.0;" and backends (here, s1) prepended to it.
  • c1 connects to the first Varnish instance available (here, v1).
  • in s1, expect is done after the resp to make varnishtest fail faster. It's counterintuitive, I know, but trust me on this one.

Let's run this!

gquintard@home:master:~/work/varnish-cache$ varnishtest foo.vtc
...
**   s1    0.4 === expect req.url == "/content/articles/my_post.html"
---- s1    0.4 EXPECT req.url (/cmsa/post/my_post.html) == "/content/articles/my_post.html" failed
...

Unsurprisingly, that didn't go too well, but the error message (look for the line starting with '----') is helpful: s1 didn't receive what it expected (req.url was resolved as "/cmsa/post/my_post.html"), which is normal because we gave no instruction to Varnish. Time to change that!

~, regsub, regsuball

Copied from somewhere on the intarwebz, and adapted for our needs, this should do the trick, right?

varnishtest "Testing URL rewriting, for real"

server s1 {
    rxreq
	txresp
	expect req.url == "/content/articles/my_post.html"

	rxreq
	txresp
	expect req.url == "/image/cutter_otter.jpg"
} -start

varnish v1 -vcl+backend {
	sub vcl_recv {
		if (req.url ~ "/cmsa/post/") {
			set req.url = regsuball(req.url, "/cmsa/post/", "/content/articles/");
		} else if (req.url ~ "/images/") {
			set req.url = regsuball(req.url, "/images/", "/image/");
		}
	}

} -start

client c1 {
	txreq -url "/cmsa/post/my_post.html"
	rxresp

	txreq -url "/images/cutter_otter.jpg"
	rxresp
} -run

What does varnishtest say about it?

gquintard@home:master:~/work/varnish-cache$ varnishtest foo.vtc 
#     top  TEST foo.vtc passed (1.607)

Cool, it works! Job done then! Or is it? Let's try to add a few tests:

varnishtest "Testing URL rewriting"

server s1 {
    rxreq
	txresp
	expect req.url == "/content/articles/my_post.html"

	rxreq
	txresp
	expect req.url == "/image/cutter_otter.jpg"

	rxreq
	txresp
	expect req.url == "/image/user-pics/images/avatar1234.jpg"

	rxreq
	txresp
	expect req.url == "/othersite/images/moon-landing.jpg"
} -start

varnish v1 -vcl+backend {
	sub vcl_recv {
		if (req.url ~ "/cmsa/post/") {
			set req.url = regsuball(req.url, "/cmsa/post/", "/content/articles/");
		} else if (req.url ~ "/images/") {
			set req.url = regsuball(req.url, "/images/", "/image/");
		}
	}
} -start

client c1 {
	txreq -url "/cmsa/post/my_post.html"
	rxresp

	txreq -url "/images/cutter_otter.jpg"
	rxresp

	txreq -url "/images/user-pics/images/avatar1234.jpg"
	rxresp

	txreq -url "/othersite/images/moon-landing.jpg"
	rxresp
} -run

And BOOM!

gquintard@home:master:~/work/varnish-cache$ varnishtest foo.vtc
...
**   s1    0.5 === expect req.url == "/image/user-pics/images/avatar1234.jpg"
---- s1    0.5 EXPECT req.url (/image/user-pics/image/avatar1234.jpg) == "/image/user-pics/images/avatar1234.jpg" failed
...

We may have been overzealous here, and the second "/images/" also got converted into "/image/", not good. Let's take a step back and look at the code; after all, we are using regular expressions without having introduced them first:

  • ~: checks the pattern match, and in our case, if it does, we enter the if statement. This prevent us from trying to execute all the regsuballs in sequence, we only want to run them on the original req.url.
  • regsuball(STRING, PAT, REP): looks at STRING, and replaces all PAT occurrences with REP.

So the error comes from regsuball, that captures all the matching patterns. Maybe we should be better with regsub? It does the same thing, but only on the first match. Asking varnishtest, we get:

gquintard@home:master:~/work/varnish-cache$ varnishtest foo.vtc
...
**   s1    0.5 === expect req.url == "/othersite/images/moon-landing.jpg"
---- s1    0.5 EXPECT req.url (/othersite/image/moon-landing.jpg) == "/othersite/images/moon-landing.jpg" failed
...

Okay, good news and bad news here. Good news is we passed the previous test, bad news is we failed the next one because /images/ is only interesting to us if it starts the location, and we didn't tell that to the VCL.

When using regular expressions, we can signify the beginning of a string with "^" and the end of it with "$", so this should work:

sub vcl_recv {
	if (req.url ~ "/cmsa/post/") {
		set req.url = regsub(req.url, "^/cmsa/post/", "/content/articles/");
	} else if (req.url ~ "/images/") {
		set req.url = regsub(req.url, "^/images/", "/image/");
	}
}
gquintard@home:master:~/work/varnish-cache$ varnishtest foo.vtc
#     top  TEST foo.vtc passed (1.607)

And it does! That's a relief! Let's stop here, on a victory. The point here is to understand that even if the logic seems sane, tests will often reveal problem, and so, you really should write them.

Doing more complicated stuff

Regular expressions are a powerful tool allowing you to describe and change text, for example, you can swap the first and second directories of a URL:

set req.url = regsub(req.url, "^/([^/]+)/([^/]+)/", "/\2/\1/");

Underneath the shiny VCL coating, Varnish uses libpcre, a standard among regex implementations, meaning the regex you write in VCL should be compatible pretty much everywhere (excluding character escapes). Notably, if you want to try out a regular expression, have a look at sites like regex101.com to explain what's going on.

Scaling up

"What if I have thousands of rules?" you may be wondering. Well, first, you should not have thousands of rules because it's going to make your life harder, but stuff happens...

Having thousands of rewrite rules isn't really a problem because VCL compilation mandate the regext to be literals (known-in-advance strings) so as to compile and optimize them when VCL is loaded. Not only that, but don't forget that it's all C behind the curtain, so it's super fast.

Going back to your hypothetical question, put all your rules inside one file:

"/cmsa/post/","/content/articles/"
/images/","/image/"
"/foo/","bar"
"baz/([^/]+)/","/\1/"
...

Use a script to generate the VCL, for example, in python:

import csv

out = []
with open('foo') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
print(row[1])
out.append('''if (req.url ~ "{}") '''.format(row[0], row[0], row[1]))

print("# generated code, do not edit")
print("sub rewrite_url {\n\t", end="")
print(" else ".join(out), end="")
print("\n}", end="")
# generated code, do not edit
sub rewrite_url {
    if (req.url ~ "/cmsa/post/") {
        set req.url = regsuball(req.url, "/cmsa/post/", "/content/articles/");
    } else if (req.url ~ "/images/") {
        set req.url = regsuball(req.url, "/images/", "/image/");
    } else if (req.url ~ "/foo/") {
        set req.url = regsuball(req.url, "/foo/", "bar");
    } else if (req.url ~ "baz/([^/]+)/") {
        set req.url = regsuball(req.url, "baz/([^/]+)/", "/\1/");
    }
    ...
}

Include and use it in your VCL, and voilà:

include urls.vcl

sub vcl_recv {
call rewrite_url;
}

Oooooooooooor, if you are a Varnish Plus customer, you can use vmod-kvstore to map old URLs to new ones (with no regex mumbo-jumbo) and load the file directly from VCL, with no script to transform the data.

Your file would look the same, but with no quotes:

/url/1,/newurl/1
/url/2,/newurl/2

And from the VCL:

sub vcl_init {
    kvstore.init_file(0, 25000, "/some/path/rewrites.csv", ",")
}

sub vcl_recv {
    set req.http.X-rewrite = kvstore.get(0, req.url, "");
    if (req.http.X-rewrite != "") {
        set req.url = req.http.X-rewrite;
    }
}
Super easy, right?

A word of wisdom

However, being easy doesn't make it right, and you should plan and aim to never need such extreme cases. Behind this, there a very mathematical reason: the pool of URLs you have to rewrite can only grow, never shrink, and that can a lot of rules to keep track of.

This is why your first rule source should never be the VCL. It should be generated from a git repository, or from a database: keep it in a neutral format, replicated so you can re-use and transform the data.

Also, remember when I said that regex are powerful, two sections ago? I meant it. They are so powerful that if you use them wrong and shoot yourself in the foot, you generally vaporize the floor, burn your whole leg, AND hurt your feelings. People telling you otherwise may be perl users, beware!

More seriously, regular expressions should be used carefully, and this is why we kicked off by using varnishtest. But also, regex should only be used when they are the right tool for the job, but sometimes people forget that, and do crazy things, like this. In passing, note that this article is 7 years old, and the URL still works!

In VCL, you'll be tempted to use regex to manipulate querystrings and cookies. If it happens, restrain yourself! We have vmod-cookie in varnish-modules and vmod-querystring to deal with them in a safe and sane manner, and you should definitely use them.

And that's all for now, stay tuned for the next post!

[1]: not a typo, just a play on words and pop culture, a cyber-pun(k), if you will...

Image (c) 2012 astroshots42 used under Creative Commons license

Topics: varnishtest, URLs, rewriting URLs, regex

12/14/16 1:31 PM by Guillaume Quintard

All things Varnish related

The Varnish blog is where the our team writes about all things related to Varnish Cache and Varnish Software...or simply vents.

SUBSCRIBE TO OUR BLOG

Recent Posts

Posts by Topic

see all