In this blog post we’re exploring some aspects of Varnish Enterprise that enable what it is best known for: excellent performance. When we talk about the performance of caching servers, what do we mean?
- Cache hit ratio - how many requests can be satisfied by the cache without visiting the backend
- Lookup time - how long it takes to locate the content in memory
- Response time - how quickly the cache starts delivering content to the user
At Varnish Software we are obsessed with performance. R&D teams focus on making Varnish Enterprise able to handle as many requests as possible in cache, as quickly as possible, and for the client to receive requested content with the lowest latency. What follows are design decisions and features that contribute to Varnish Enterprise’s performance capabilities. Some of them were instrumental in reaching our 500 Gbps throughput benchmarks on standard hardware with Intel.
A highly optimized fast path
The fast path refers to the code path(s) where everything works to our advantage - the request is a cache hit and the object can be delivered from memory. Here context switches, locking and memory allocations / deallocations are minimized to achieve as much performance as possible. Of course, other parts of Varnish are optimized, too, but special care has been put into the code which is most important to performance.
Varnish Massive Storage Engine (MSE)
An advanced engine that handles storage of cached objects on disk and in memory. It allows large volumes of data beyond memory capacity to be stored in cache, for increased cache hit ratios, while ensuring popular content is kept in memory. Because it controls which parts of the cached objects are in memory, memory usage is controlled explicitly from user space rather than the kernel. Cache data is written to and read from disk asynchronous direct I/O operations, bypassing the kernel’s virtual memory system. Having MSE accurately determine which content to keep in memory is much more efficient than leaving those decisions to the kernel and means less pressure on disks through reduced I/O operations.
Efficient memory allocation
In addition to minimizing memory allocations and deallocations, a memory allocator optimized for heavily threaded applications is used in Varnish. This is important both when Varnish runs in memory only mode and when physical disks are used to cache objects. In both cases, the Memory Governor feature self-regulates the cache size to keep memory usage constant, taking into account other memory requirements. This way, memory usage is optimized while ensuring adequate overhead.
Support for Non-Uniform Memory Access (NUMA) APIs
In a NUMA-aware architecture, cores are in multiple clusters. Each cluster has its own local memory region but allows cores from one cluster to access all the memory in the system. However, there is a cost associated with accessing resources located on a different node than the one a thread is currently running on. Varnish is NUMA aware, which means it can use NUMA APIs to reduce this cost, compared to a NUMA unaware program.
NUMA-awareness is crucial for achieving maximum performance from servers with more than one CPU. The Varnish Enterprise architecture with thread pools and NUMA local memory pools works well in a NUMA environment, keeping inter-NUMA traffic low. It allows Varnish to make better decisions about what resources in the system to use for a particular transaction, improving the locality of accesses to memory as well as I/O devices.
Varnish Enterprise sees excellent performance scaling from single to dual processor systems thanks to this NUMA awareness, so services can be upgraded without software constraining any potential performance improvements.
In modern CPU architectures, both single-processor and dual-processor systems can actually utilize NUMA. In our testing with Intel, the single processor server used Sub-NUMA Clustering to split the single CPU into two NUMA regions.
Varnish Configuration Language (VCL)
VCL runs inside Varnish during the request-handling process, and lets you define advanced logic to extend default Varnish behavior. VCL is very, very fast since it gets transpiled down to machine code via C, running at native speed rather than interpreted at runtime. This creates significant performance gains for setups involving edge logic.
Request coalescing
This feature identifies requests for the same uncached resource, queues them on a waiting list and only sends a single request to the origin. As the origin responds, Varnish will satisfy the entire waiting list in parallel, so there's no head-of-line blocking: everyone gets the content at the same time. Request coalescing effectively collapses multiple potential requests to the origin into a single request, minimizing the amount of work done, even on cache misses. The performance advantages are straightforward: less pressure on the origin server and less latency for queued clients. Request coalescing is useful for caches with limited storage that need to evict objects to free up space, such as live streaming where segments are constantly added to the live stream.
TLS implementation
Varnish offers Hitch, which is a scalable, high-performance TLS proxy designed for terminating TLS at scale, when it is useful to run TLS as a separate process. It is safe for large installations, with up to 15,000 listening sockets and 500,000 certificates. It supports seamless run-time configuration reloads of certificates and endpoints, as well as OCSP stapling, TCP fast-open, mTLS and Unix Domain Sockets (UDS). However, there is a performance cost associated with an extra layer of software, even when it communicates with Varnish through UDS on the same physical machine.
For enterprises like streaming services and network operators who want to push performance, Varnish Enterprise also offers in-process TLS. Enabling TLS natively within Varnish eliminates the need for the extra layer, reducing operational complexity while increasing HTTPS performance and minimizing latency. Built-in TLS was an important factor in hitting 500 Gbps throughput.
This is just a selection of the architectural factors and features that contribute towards Varnish Enterprise’s leading performance. There are others, including high availability and content prefetching, that also help Varnish deliver content quickly to users at scale.
For now, to read more about how the above capabilities work in practice, take a read of our joint white paper with Intel: Delivering up to 500 Gbps Throughput for Next-Gen CDNs 👇