Feeding GPUs at Scale: What AI Infrastructure Teams Can Learn from Tiered Caching Architectures

Executive summary

Varnish is a high-throughput, multi-tier caching solution that can eliminate object storage bottlenecks and triple GPU utilization for enterprise AI infrastructure teams managing large-scale training clusters.

For massive AI workloads, a single cache layer isn't always enough. You need to design a data path where each layer has a distinct job:

Capacity close to storage

Request collapsing in the middle

Performance close to compute

Realizing tiered storage with Varnish

Most of the cost of a large AI training cluster is GPUs sitting idle. One customer of ours was running their training cluster at about 25% GPU utilization. The GPUs were fine, the scheduler was fine, the model code was fine, they were just waiting on bytes from the object store.

After a tiered cache went in, utilization climbed to roughly 75%. Tripling the effective output of a cluster that is expensive is unusual, and the architecture that got us there is worth explaining.

Tier Storage Architecture (1)

The basic problem is that S3 was never built to feed GPUs. A single TCP connection to a hyperscaler bucket caps somewhere around 2 to 3 Gb/s. First-byte latency is around 120 ms. Egress lists at roughly $50K per petabyte, and a training run reads the same dataset many times across epochs. None of those numbers move in your favor, and the workload above them keeps getting hungrier. A modern AI training node wants 1.5 to 3 GB/s of dataset throughput, sustained. A thousand-node cluster wants terabit-class storage to feed it. You cannot get there by adding TCP connections.

What you can do is put a cache in front of the object store. What we did, and this is the part I want to walk through, is put three.

Three tiers, three problems

The diagram for this resembles a gradient: capacity on the left, performance on the right. The leftmost cache is the Cache of Last Resort, the COLR. It sits as close to the bucket as possible, same region, sometimes the same rack as the egress router. It is capacity-biased, utilizing cost-effective storage such as HDD or QLC NVMe. The COLR serves as the final point before egress charges are incurred; once data is cached there, the rest of the system avoids reaching out to the object store for that content. The customer above runs about 250 PB of cache against an exabyte-scale bucket (a 10–20% ratio), achieving a 99.8% hit rate that reduces egress expenses by two orders of magnitude.

In the middle is the midtier, one per datacenter or per InfiniBand fabric. The midtier is balanced, enough RAM to keep medium-warm content resident, enough NVMe to hold a substantial chunk of the dataset. It exists for a reason that isn't obvious until you skip it: without a midtier, each edge node makes independent requests upstream, which fragments your hit rate and amplifies tail latency every time a training job restarts or a GPU server gets reimaged. The midtier collapses N edges into one connection to the COLR. For datacenters spread geographically, the midtier also removes much of the locality and data gravity issue by keeping relevant data in the same datacenter as the GPUs that need to read it.

On the right is the edge tier, the thing the consumers actually talk to. Edge nodes are performance-first. DRAM or 5th gen NVMe is the storage. NICs are 400, or 800 Gb/s. The design target is around 1 Tbps per node for in-cache content, and that target is what makes the cluster-wide throughput numbers work out. Edge tier eviction is intentionally aggressive: cold objects fall back to the midtier rather than wasting RAM. The edge holds the shards in the current epoch, the manifests, metadata, the bytes being read right now.

The fabric boundary

The interesting engineering choice in this customer's deployment is where the InfiniBand fabric stops. InfiniBand switch ports are precious. Not figuratively. They are a real budget line, and the cluster operator who has spent millions on a fabric is not going to spend more of it on storage nodes than is strictly necessary.

So the edge caches sit with one leg in the fabric and one leg outside it. The midtier sits entirely outside the fabric, on plain Ethernet. From the GPU's point of view, the storage path is InfiniBand to the edge, Ethernet from the edge to the mid, IP to the COLR, S3 to the bucket. The bandwidth gradient and the network gradient are aligned: the closer you get to compute, the more expensive every meter of cable becomes, and the more aggressively the cache works to keep you from having to pull a byte across it.

This is beyond a pure architectural decision. It is also a budget decision, because the alternative is putting fat mid-cache boxes on InfiniBand ports that should be hosting GPUs.

The cache is lossy on purpose

This is the part I most wanted to write about. Varnish Enterprise, the cache software running in all three tiers, is built to be lossy. There is no RAID. If a disk in a cache node fails, that node loses the content on that disk, and the next request for any of it gets refetched from the tier above. The system is engineered around the assumption that the upstream is the source of truth and the cache is allowed to forget.

This sounds reckless until you sit with it. Lossy caching means that when we write a new object to disk and there's an old object in the way, we don't fragment the write to fit around the old one, we discard the old object. The write goes down as one contiguous extent. That's faster to write. It is also faster to read. The next time someone asks for that new object, we hand them back one clean range instead of a scatter-gather of fragments. Sequential I/O on the way in, sequential I/O on the way out.

Can we stop and appreciate how much engineering this design choice quietly avoids? RAID rebuilds at petabyte scale are a logistical nightmare, and fragmentation under heavy churn slows every subsequent read for the lifetime of the file. By embracing 'lossy' caching (accepting that the cache is allowed to forget while the upstream remains the source of truth) we eliminate those overheads entirely.

The result

The same customer is now pushing about 1.2 Tbps through their edge tier while churning roughly 75,000 objects per second. The cache is constantly rotating, training shards come in, get read by half the cluster a few times across an epoch, and get evicted to make room for the next one. The hit rate against the midtier is high enough that the COLR is mostly idle, and the COLR's hit rate against the bucket is high enough that the egress invoice has become a rounding error.

I'll be honest, when we first sized this I expected the edge tier to struggle under that kind of object turnover. Cache eviction at 75K/s is brutal. However, it does not struggle. This almost feels like magic, but it isn't, it's that the storage path was designed around the assumption that the cache will forget sometimes, and has to skip all the work that "always remembering" would have cost.

Where this is overkill

If you take one thing from this post, take this: do not build three tiers of cache because it looks impressive on a diagram. This architecture is useful when the workload has: multiple datacenters, thousands of nodes per DC, an exabyte-class bucket, and a GPU bill that makes 25% utilization an unacceptable line item. If you don’t tick multiple of these boxes, you are probably paying for complexity you do not need.

The structure does scale down, though. A single rack of inference servers with a dataset that fits in node RAM doesn't need a COLR or a midtier, it just needs an edge tier in front of S3 to deal with the per-connection cap and the first-byte latency. The tiering idea is fractal. The capacity-to-performance gradient still applies even when you only have one tier; the tier just has to be on the right side of that gradient for the workload.

Three tiers is what falls out when the workload is large enough that no single tier can cover both ends of the gradient. The customer above is at the far right of that spectrum. Most workloads aren't, and that is fine.

Want to learn more?

I'm at per.buer @ varnish-software.com if you're interested in discussing this topic further. Alternatively you can reach out to Varnish in general.

Feeding GPUs at Scale: What AI Infrastructure Teams Can Learn from Tiered Caching Architectures

Executive summary

Realizing tiered storage with Varnish

Three tiers, three problems

The fabric boundary

The cache is lossy on purpose

The result

Where this is overkill

Want to learn more?

You may also like this

Multi-Tier Setups and Technology Showing Its Age

SUBSCRIBE TO OUR BLOG

SEARCH OUR BLOG