Varnish Cache has a history of relying on the operating system kernel for its performance. I’ve been keeping an eye on two very interesting components in the Linux kernel that were built exactly for this scenario. bcache and dm-cache. Both of these components use solid state drives to cache hard disks. That gives us the possibility to add a secondary caching layer to our setup. So, we add one terabyte of flash storage to our 20 terabyte server, and suddenly we have quite a bit of cache.
Adding flash storage for caching gives us two distinct advantages. One is that the SSD can be used for read caching. It should be able to be able to provide a reasonable hit rate if there are any patterns in the in traffic.
The second advantage is that it can provide a writeback cache for the HDDs. Sudden bursts of traffic will create significant IO load. If the writeback cache works as it should the caching layer will push the writes to the SSD and let them trickle down onto the HDD when the HDD has IO capacity to spare.
Setting up Bcache
When testing I used our lab servers. These are Haswell servers with E5 Xeon CPUs with solid states and hard drives running Debian Jessie (testing). Bcache was a bit of bother to set up as nobody has packages the user space tools yet. There is a PPA however, so getting packages installed is simple. You might need to rebuild the package, depending on your platform.
Once installed you need two partitions. One will be the HDD partition you want to cache and the other is the SSD partition. Then you initialize the HDD partition:
make-bcache -B /dev/sdb1
Make a note of the UUID generated. Then initialize the cache:
make-bcache -C /dev/sda5
and write down this UUID as well. Then register the HDD partition (backing store):
echo “67addf65-f204-4e4e-8b23-2328bdf599a3” > /sys/fs/register
Then go to /sys/fs/bcache/67addf65-f204-4e4e-8b23-2328bdf599a3 and attach the cache:
echo "dcf6b1bc-6667-11e4-b251-08002758f0fe" > attach
You probably want to enable writeback caching:
echo writeback > cache_mode
Now you’re almost done.
mkfs /dev/bcache0
and you’re done.
The setup is persisted through some udev magic. You can mount the filesystem and start using it.
Setting up dm-cache
The other implementation I’ve looked at is dm-cache. It is basically the same as bcache, but happens in the device mapper in the linux kernel. Setting up the device mapper by hand is somewhat of a bother so I found it a lot easier to use LVM to do this. Recent versions of LVM supports dm-cache so I opted to use LVM to set it up. Create two physical volumes with pvcreate. One on the HDD and one of the SDD. Then create a volume group (VG) with both of them:
vgcreate data /dev/sdb1 /dev/sda5Then create three logical volumes. One is the backing device, which we place on our HDD and two on the SSD:Here I’ve created a 1 terabyte LV for the backing and a 100M cache and a 10M store for metadata.
Calculating the size of the metadata LV is based on the following:
You pick a chunk size for the cache. It should be something close to your average object size in Varnish. If you make it bigger, you’ll be wasting a lot of space. If you make it too small, the overhead will increase. Further research on this is probably needed, especially if your workload has a lot of different sized objects.
Once you’ve figured out your chunk size, I used 64K, divide the size of your cache LV by this number. Each chunk takes 64b of metadata in the metadata store. So here I have 100M of cache and 64K chunk size it will be:
(100M / 64K) * 64 = 6553600. I rounded this up to 10M. As you can see the metadata volume needs to be roughly 7% of the cache volume.
lvcreate -L100M -n cache data /dev/sda5
lvcreate -L10M -n meta data /dev/sda5
lvcreate -L1T -n data data /dev/sdb1
Now we create a cache pool out of the two cache volumes:
lvconvert --type cache-pool --cachemode writeback --chunksize 64k --poolmetadata data/meta data/cache
and then attach it to our backing device:
lvconvert --type cache --cachepool data/cache data/data
Then we mkfs, mount and fire up Varnish.
Your hard disk now has a SSD cache.
Troubles with bcache
When testing I had problems getting bcache to work properly. The first time it worked OK, but when I tried to set it up again I triggered a kernel BUG which prevented me from getting up back up again. Just loading the module triggered this. I didn’t have the time get to the bottom of this. Since Debian doesn’t ship with the bcache user space tools there are probably some interactions between various components that make it unstable. At this point I would advise against running bcache on Debian at least. You might have better luck on other platforms.
Benchmarking
When developing the new storage engine we've also made a standardized test which we've been using to keep an eye on performance in the last couple of interations. Size of the objects is exponential, and the access frequency is Zipfian. The average object size is 80KB. We're using Siege to generate the requests and some R scripts to collect and visualize the data.
Initial test results
The first tests I did I did with just the whole 13GB partition on the SSD. The file Varnish used for storage was only 2GB so this is pretty far away from a real world scenario. However, the graph is still somewhat interesting. I ran three tests. One on the HDD directly. One with dmcache and one with bcache.
Initially, the kernel will buffer. Soon it runs out of memory and Varnish will become limited by the performance of the IO layer. I'm sort of stumped to see the weird drop and rise that bcache exhibits. You can see the same thing on the HDD, but in a much smaller scale.
Complicating the picture - adding XFS to the mix
At this point, I was unable to setup bcache again, so I decided to stick with dm-cache. dm-cache is also supported, as a tech preview, in RHEL 7, so I'm guessing it will be the preferred choice for many users.
So far I've been using ext4 exclusively. ext4 is very, very robust, and I've yet to loose data due to ext4 itself or any of it's parents (ext2 and ext3). However, we don't worry that much about data loss here, since we're caching. So, we'll prioritze performance. I've been hearing great things about how well XFS performs so I wanted to give it a spin.
This time around I used much smaller sizes. The server was limited to 1gigabyte of memory. Varnish was still using a 2GB file and the cache size was only 100M, roughly 5% of the data set.
As you can see there are significant performances changes initially. XFS seems overall to perform a lot better. From what I know about XFS it is a lot more aggressive when it comes to buffering which allows it to coalesce a lot more writes, thereby increasing performance. As time passes performance seem to even out and there isn't much difference between the accelerated harddrive and the uncached one.
As this server now has one gigabyte of memory and only 100M of flashcache this makes sense. The read cache that the flash layer provides probably has a hit rate of very close to zero. It is clear that this experiment should be redone with a 10x bigger dataset. Having 20GB of HDD, maybe 5GB of flash cache and 1GB of memory would give us something closer to a real world workload. I hope to be able to regenerate our data set and perform new tests.
Final thoughts
dm-cache seems like a really good way of accelerating your harddrive when using it together with Varnish. With components such as dm-cache there is a potential to lose some data but I wouldn't worry about that. Even if running without redundant IO the worst thing that can happen when the SSD fails is that that particular Varnish server goes down and gets taken out of service. So, if you have a large dataset that you are accessing on HDDs I would advice you to look into dm-cache. Especially if you are using a new storage backend (to be announced later this month).
Download the Varnish Book to learn more tricks for managing your Varnish Cache installation optimally.