thread_pool_add_delay
REALLY adds delay
ps aux | grep varnishd | grep thread_pool_add_delay
-p thread_pool_add_delay=1
-p thread_pool_add_delay=1
default_ttl
is more of a fallback than a default
Vocabulary: TTL stands for "Time To Live" and is the period during which an object should be served from cache.
I've encountered quite a few users expecting that setting default_ttl to 120 was a blanket instruction telling Varnish to keep all object for 2 minutes. Welp, guess what? That's not quite the case.
Let's have a look at what varnishadm has to say about it:
default_ttl
Value is: 120.000 [seconds] (default)
Minimum is: 0.000
The TTL assigned to objects if neither the backend nor the VCL
code assigns one.
NB: This parameter is evaluated only when objects are created.
To change it for all objects, restart or ban everything.
There's a very important sentence here: "The TTL assigned to objects if neither the backend nor the VCL code assigns one." Varnish tries to be a good HTTP citizen, and it will trust the backend, reading its "Expires", "Age" and "Cache-Control" headers (among others) to compute the TTL according to the RFC. default_ttl is only used if we weren't able to get ANY information from the backend. If you wish to set a blanket TTL, you have to do it in VCL:
sub vcl_backend_response { set beresp.ttl = 120s; }
One other consequence is notably that changing "max-age" in beresp doesn't impact the TTL for Varnish because the TTL is computed only once, before we enter vcl_backend_response, as can be seen in the varnishlog:
- Timestamp Beresp: 1510666680.699230 0.000801 0.000719
- BerespProtocol HTTP/1.0
- BerespStatus 200
- BerespReason OK
- BerespHeader Server: SimpleHTTP/0.6 Python/3.6.3
- BerespHeader Date: Tue, 14 Nov 2017 13:38:00 GMT
- BerespHeader Content-type: text/html; charset=utf-8
- BerespHeader Content-Length: 858
- TTL RFC 120 10 0 1510666681 1510666681 1510666680 0 0
- VCL_call BACKEND_RESPONSE
- BerespHeader X-Url: /
- BerespHeader X-Host: localhost
Varnish first decides on the TTL (default here is 120s because the backend stays mums about what we should do), and only then it executes the VCL code that has a chance of overriding the TTL.
Note: if the TTL line is a bit obscure, have a look at "man vsl", it lists all the tags and fields you'll find in the varnishlog.
default_grace
will kill your live streaming
Vocabulary, again: grace is the time during which, after the TTL, Varnish is allowed to deliver an object. If that happens, a new object fetch will be triggered asynchronously to freshen the cache. That way we can always deliver something immediately without waiting for the backend, at the cost of delivering something slightly stale.
default_grace is the same as default_ttl, but for grace (duh!): if the backend gives us no indication, we set the grace to that value, and the default default_grace is 10s. Easy enough.
The issue for live streaming is that "slightly stale" isn't good enough. Live streaming relies on a frequently updated manifest file with the same URL but with its content being renewed every few seconds to point to new chunks. By delivering a manifest that can be up to 10 seconds late, you hide the newer chunks from the player, which may react badly.
If you are doing live streaming, the best thing to do is to add "-p default_grace=0" to your command line so that the backend can still override it if needed, but you can also set beresp.grace
in vcl_backend_fetch
.
Builtin
This one was present in Espen's entry, and with good reason. Failure to understand the builtin can destroy your perfomance because of cookies, notably, but not only.
When you load a VCL, Varnish will append its own built-in version to it. It makes sure that even an empty VCL can still work. The idea is that if your subroutine doesn't return with a decision (pass or lookup, for example), Varnish can turn to the built-in to get it.
You can output the built-in VCL using
varnishd -x builtin
But, you can also find it here: https://github.com/varnishcache/varnish-cache/blob/4.1/bin/varnishd/builtin.vcl . And your distribution probably also packages a commented version somewhere in /usr/share/varnish/.
Let's have a look at the most frequent and traumatic annoyances. In vcl_recv for example, you have these lines:
if (req.http.Authorization || req.http.Cookie) {
/* Not cacheable by default */
return (pass);
}
If a request has an Authorization header, or a Cookie one, the request is certainly customized, so Varnish backs off by default, to avoid caching private information. If your CMS is using trackers, such as Google Analytics, for example, all requests will probably arrive with a cookie, meaning they won't be cached, which can be a danger to your backend.
Another example is vcl_backend_response:
sub vcl_backend_response {
if (bereq.uncacheable) {
return (deliver);
} else if (beresp.ttl <= 0s ||
beresp.http.Set-Cookie ||
beresp.http.Surrogate-control ~ "no-store" ||
(!beresp.http.Surrogate-Control &&
beresp.http.Cache-Control ~ "no-cache|no-store|private") ||
beresp.http.Vary == "*") {
# Mark as "Hit-For-Miss" for the next 2 minutes
set beresp.ttl = 120s;
set beresp.uncacheable = true;
}
return (deliver);
}
If the backend returns with a set-cookie header, a negative TTL, or a no-cache instruction, Varnish will mark the object as uncacheable for 120 seconds (to disable request coalescing).
There are various ways to fix this, from correctly configuring the backend to cleaning the request or response before reaching the built-in to bypassing the built-in altogether. The built-in is here to make sure things don't fail spectacularly, security-wise; the trade-off as you can see is performance, but it's only a default configuration, and you can override it to fit your setup perfectly.
Note: There's an interesting discussion about the built-in VCL and cookies on the varnish-misc mailing-list archive, I warmly encourage everyone to read it: https://varnish-cache.org/lists/pipermail/varnish-misc/2017-July/026023.html.
Varnishlog
varnishlog is an amazing tool. Let me insist on that point: varnishlog is probably the greatest and most useful tool in the Varnish arsenal. Yeah, that much!
It delivers an amount of information unparalleled (to my knowledge) regarding HTTP transactions; you get:
- All the headers of requests and responses, both on the client and backend sides
- Timing information
- Error messages
- TTL, Grace, Keep information
- Backend info (which one was used, probe states)
- IPs and ports used for each transaction
And frankly, I can't live without it. It's quite normal that people would want to store them on disk. But this amount of information comes at a price: let's compare the same request viewed via varnishncsa and via varnishlog:
# varnishncsa
::1 - - [14/Nov/2017:07:38:00 -0600] "GET http://localhost/ HTTP/1.1" 200 858 "-" "curl/7.56.1"
# varnishlog
* << Request >> 6
- Begin req 5 restart
- Timestamp Start: 1510666680.699664 0.000045 0.000000
- ReqStart ::1 52104
- ReqMethod GET
- ReqURL /
- ReqProtocol HTTP/1.1
- ReqHeader Host: localhost
- ReqHeader User-Agent: curl/7.56.1
- ReqHeader Accept: */*
- ReqHeader X-Forwarded-For: ::1
- VCL_call RECV
- VCL_return hash
- VCL_call HASH
- VCL_return lookup
- Hit 3
- VCL_call HIT
- VCL_return deliver
- RespProtocol HTTP/1.0
- RespStatus 200
- RespReason OK
- RespHeader Server: SimpleHTTP/0.6 Python/3.6.3
- RespHeader Date: Tue, 14 Nov 2017 13:38:00 GMT
- RespHeader Content-type: text/html; charset=utf-8
- RespHeader Content-Length: 858
- RespHeader X-Url: /
- RespHeader X-Host: localhost
- RespProtocol HTTP/1.1
- RespHeader X-Varnish: 6 3
- RespHeader Age: 0
- RespHeader Via: 1.1 varnish (Varnish/5.1)
- VCL_call DELIVER
- VCL_Log vcs-key:localhost
- VCL_Log vcs-key:localhost/
- VCL_Log vcs-key:TOTAL
- RespHeader X-Cache: HIT
- VCL_return deliver
- Timestamp Process: 1510666680.699681 0.000062 0.000017
- RespHeader Accept-Ranges: bytes
- RespHeader Connection: keep-alive
- Timestamp Resp: 1510666680.699699 0.000080 0.000018
- ReqAcct 73 0 73 299 858 1157
- End
As you can expect, the latter is going to consume a vast amount of disk space, which has brought many servers to their knees because the admins were not ready for it and the disks got full super fast.
To alleviate that, you can of course use logrotate to clean older logs, or use logstash to push that outside of the machine, but maybe you just don't need to log EVERYTHING, do you? The varnishlog output is awesome for debugging, but on a day-to-day basis, you probably only need the synthetic report or varnishncsa (did you know you can add fields to its format line? Read "man varnishncsa
" for more!).
One popular trick is to let varnishncsa collect all the transactions and use that as primary log, and to set up varnishlog to only log problematic transactions. Here's an example:
varnishlog -q "Timestamp:Resp[3] > 2.0 or RespStatus != 200" -g request -w log.bin
This one asks to filter only the requests that took more than two seconds to be delivered, or with a status code that wasn't 200. I'm also grouping by request so that I can get the backend request associated. And, VERY important, I'm not just outputting text, I'm writing to a binary file (log.bin) so that I can use filtering again on it:
varnishlog -r log.bin -q "RespStatus == 503" -g request
Note: upon receiving a SIGHUP, varnishlog and varnishncsa will reopen the file, creating a new one if needed, making sure the binary file is still usable. That means they are totally compatible with lograte default behavior. Ain't life good?
The file storage has no persistence
The primary cache storage in Varnish is the "malloc" one, a memory storage that's extremely fast. It has one tiny little flaw though: if your server reboots, or Varnish is restarted (not reloaded), you lose your cache. For this reason, users turned to the "file" storage, thinking that since the data is stored on disk, they would be able to reuse it.
That can be a serious issue because if a server with a clean cache gets thrown in the deep end of the pool directly (i.e., full traffic, no ramp-up period), its hit ratio is going to be terrible, resulting in high backend traffic, potentially crushing the backends.
Obviously, the remedy is to have persistence, and this is solved with the Massive Storage Engine. MSE also stores on disk, avoids performance issues linked to fragmentation and features optional persistence through the use of a journal, much like file systems do, to keep track of what's actually on disk and what's reusable so that a dirty power-off won't corrupt your cache.
The more you know
These mistakes can do a lot of damage, but at the same time, they are also very easy to avoid, once you know about them, which is why I hope this helped at least some of you to dodge a bullet or two. All this can be found in the documentation, of course, but remember that if you need some help, we are always lurking on IRC, Twitter as well as listening on the varnish-misc mailing-list, so I invite you to join us there so we can assist you!
Also you can watch our on-demand webinar, 10 Varnish Cache mistakes and how to avoid them, to get more insight into avoiding fatal Varnish mistakes.