Amazon S3 was one of the original web services. Launched in 2006, the service delivered on its promise of being a “simple storage service” accessible over the Internet. But outside of Amazon, very few people would have predicted the service’s success.
Fast forward 18 years and not only has the service grown; storing trillions of objects, exabytes of data and serving billions of calls daily, but the way applications interface with the service (or protocol) has become the defacto cloud storage protocol. So today, Amazon S3 is a service and the S3 protocol is an interface that every major cloud storage service (Azure, OCI, GCP, Wasabi, Backblaze…), storage solution (Dell/EMC, HP, NetApp, VAST, Pure…), and many SaaS applications use as the primary interface for storing and accessing data. This was all driven by three very basic value propositions:
- Infinite scalability: there’s room for your data, always
- Instant accessibility over HTTP
- And pay as you grow
Which is a pretty good deal. Alas, life is made of compromises and the challenge is locality, leading to a couple of important caveats, namely performance and cost. It’s a good thing Varnish exists then, because with it, we can build an S3 shield: a caching layer that will preserve the benefits of cloud storage while also solving that pesky locality problem.
Egress costs, ingress costs, API calls and not paying for them
S3 clusters are usually big, in terms of machine footprint, which makes sense as we want them to be able to store everything we throw at them. The downside is that those clusters generally exist in a handful of locations and aren’t moved easily. If you maintain your own cloud storage service, you need to decide where your machines run, and if running on AWS or GCP, you need to pick in which regions the data will reside (or pick multiple regions, for a fee).
That is going to be important if you need to use your data in a different location than where it is stored. For example:
- If your EC2 cluster in region X downloads data from an S3 bucket in region Y, you’ll pay inter-region transfer fees
- If you need your S3 assets to be used by an on-prem cluster, your hyperscaler will probably charge you egress fees
- Multi-cloud users storing data in GCP but using it in EC2 will see both ingress fees from AWS and egress costs from GCP.
It’s of course true of any platform that charges network fees, and to be fair, those fees are generally very reasonable and justified, but as your usage grows, so do they. Maybe to the point that you need to do something about it.
The solution is devilishly simple: put a Varnish cache node (or many!) where you are going to use your data. The transfer costs will be massively reduced by cacheability since most of the data will come from the local Varnish, on the “free” part of the network.
And since we are talking about cost, let’s mention that storage-as-as-service offers will generally charge you per API call. Again, it can be very modest, or it can be substantial depending on your usage. This is a pretty good example of a “one size fits all, for a fee” policy. On this aspect, our Varnish strategy is paying here too, since we’ll reduce the number of actual calls reaching the service, making it beneficial even for local setups where transfer fees aren’t a topic.
Distance, latency and performance concerns
Even if money isn’t an issue, locality, or lack thereof, presents another challenge: your data needs to travel from its storage to the user. The speed of light and optic fiber have their limits, so the farther you are from the storage, the more latency you’ll incur, to the point you might spend more time waiting for the first byte than downloading the full object.
On top of this, there’s always the problem of sharing network resources: the farther away you are from your source, the more chances there is that some pipe in the middle is already congested, slowing down the whole transfer.
And on top of this, as said in the introduction, cloud storages are built for scalability and resilience, speed, bandwidth and latency are secondary. It’s a solid stance: being fast is useless if you’ve lost the data. Still, performance may be lacking if your application is a hungry data consumer (data sets for AI learning, anyone?).
As you can imagine, with the problem being so similar to the previous section, the solution is the same: put Varnish close to the users. The benefit is we’re saving time on top of money.
In this case, we still need to suffer the performance impact on the first download of a file, but once it’s in, it will serve from cache locally, at blazing speed.
Starting easy and very, very fast
The S3 shield is built on top of Varnish Enterprise and will reduce costs and improve performance the instant it’s deployed.. Here’s the minimal configuration we need to get started:
In short, tell Varnish where your storage is, how long the data should be cached, and the credentials to authorize the requests (if needed), and it’ll handle the rest.
Additionally, Varnish will happily validate JWT token, rate-limit traffic, compress data on-the-fly, provide clustering and the usual exhaustive logs and metrics we know and love.
Trying it out
If all this sounds good to you, maybe you like to see it in action, caching your own S3 bucket? If so, you can jump in right now! The code is publicly available right here.
On top of the actual configuration, you can find the full documentation on the developer portal which will walk you through the configuration itself, but also covers deployments via various methods, like docker compose, helm or terraform.