June 13, 2013
3 min read time

Advanced cache invalidation strategies

“There are only two hard things in Computer Science: cache invalidation and naming things.” -- Phil Karlton

Today, I’ll write about the hardest one of these two; cache invalidation. This used to be rather straightforward. Most websites had a rather simple structure between their content repository and their URL structure. This made it rather simple to purge URLs when the content changed.

Today the situation for many sites is a bit more complex. It is not uncommon to have an article being presented in six or ten different variants, under multiple URLs. Some sites operate with separate URL spaces for desktop, mobile and tablet. In addition an article might be partially referenced on other pages. Front- or section pages might refer to a whole lot of articles and the article itself might have content from other articles embedded into it. Keeping track of all these relations can be a pain.

One possible solution for this has been to use the ban function in Varnish. You do this by having the content presentation system list the article IDs on the object in a header, like this:

X-Article-IDs: 39474, 39232,38223,32958

Then you can issue a ban matching on the article ID and it will invalidate every object that references the matching IDs.

ban obj.http.x-article-id ~ “[ ,]$ID\D”

Very nice and very powerful. However, there is a flip side to the bans in Varnish. They are quite expensive. Every ban you issue will be matched against every object in the cache that is older than the ban itself. This will either happen at the time of delivery or will be done by a background worker thread commonly referred to as the ban lurker. So, in terms of scalability we're looking at N x M, where N is the number of objects in memory and M is the number of bans.

This is not a problem if you only issue a ban from time to time, but if you are dealing with a lot of them and you have many objects in memory they will require a lot of CPU time. In addition there are locks that will be subject to contention. For instance, it might take several seconds on a busy site to get the lock necessary to issue the ban.

Enter hashtwo (also known as hashninja)

Hashtwo was developed earlier this year to solve the scalability issues we’ve seen with bans. Hashtwo maintains an additional hash, in addition to the main hash that links objects and content in Varnish. You can put whatever you want into the hash and it will maintain a many-to-many relationship between keys and objects.

So, in the example above the content presentation layer could spit out the same list of articles in a header and hashtwo will maintain pointers from article numbers to cached object. The hash is pretty efficient and won't require much overhead. 

Then you can easily create a HTTP method similar to what you use for PURGE that would take an article ID as input and kill all the objects that reference that article.

The hashtwo module is only available to our Gold and Platinum subscribers. If you have a subscription and you want to give it a spin, send an email to our support and we'll help you get started.

Oh, and if you have an opionion on the name, we'd like to hear your view here in the comments. We're kind of split between hashtwo, hashninja, gatlingpurge and order66. We don't mind a little bikeshedding from time to time. :-)

Photo is (c) 2009 Jeyhun Pashayev and used under a CC license.