Don't bug me: how to debug and fix Varnish

Varnish Cache is quite easy to install and make it run, but is it easy to debug as well? The answer is “meh”. It can go either way.

Yes - because the Varnish tools you have at your disposal (in particular Varnishlog and Varnishstat) will give you a lot of information regarding your Varnish instance and with some patience you’ll be able to understand what is going on.

No - because if the strange behavior/crash is triggered by a peculiar scenario, then it will be difficult to reproduce, and likewise difficult to understand and fix.

With these conflicting thoughts on debugging in mind, I’d like to share some of the insights from my experience working in Varnish Software support.

There are four different types of possible problems, starting with the easiest to solve:

Functional: for example when a part of your VCL is never reached or a wrong logic has been defined and you don’t understand what Varnish is actually doing and why it is not following your will.
This issue is quite easy to debug and fix. In most cases, simply testing your VCL before pushing it in production will do the trick. I suggest you have a look at the varnishtest tool.
Operational: In cases of low hit-rates or high response times, you are facing a slightly more complicated issue for debugging. The complications arise because it not only might be caused by your VCL, but your backends and/or Varnish itself might be involved as well.
Varnishlog is usually the most suitable tool for such a scenario. It is very verbose and you should have no problem understanding which transaction does what. That is, if you want to understand why your hit-rate is so low (assuming you are not streaming; if you are streaming, then your hit-rate is expected to be not “brilliant”), you can either check whether your Varnish is seeing a lot of “passes” or a lot of “misses”. To do so, the query language available for the Varnishlog tool is the proper way to go.
‘varnishlog -q “VCL_call eq PASS”’ shows only “pass” transactions. If you have a lot of these it will lead to a low hit-rate. Fallback to point one to solve this issue.
Capacity: Is your cache storage too small for your data set? Wrong number of threads?
Varnishstat has more than 300 counters that give you a very good indication of the health of your instance since startup time.

Important counters I always check whenever I’m on support include:

MAIN.cache_hit: number of cache hits
MAIN.cache_miss: number of cache miss
MAIN.n_lru_nuked: number of LRU nuked objects from cache.
If this number is very big, then your cache storage is too small for your data set.
SMA.s0.xxx: SMA.s0 represent counters and gauges related to the actual use of your cache storage. Keeping those under control is suggested, so you’ll be able to check how the memory is used by Varnish and if you have allocated too much or too little storage for it.

And now the most time-consuming type of error to debug:

Stability: Varnish is crashing and you have no idea why.
Get yourself a coffee, relax, take your time and get your hands dirty.
Things to check:

“varnishadm panic.show”: if varnish has crashed this command will output the panic backlog:

"Assert error in CNT_Request(), cache/cache_req_fsm.c line 820:
Condition(req->req_step == R_STP_LOOKUP || req->req_step == R_STP_RECV) not true.
errno = 2 (No such file or directory)
thread = (cache-worker)
version = varnish-4.1.3 revision 5e3b6d2
ident = Linux,3.14.32-xxxx-grs-ipv6-64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
0x434025: varnishd() [0x434025]
0x438779: varnishd(CNT_Request+0x1bd9) [0x438779]
0x44fc6f: varnishd(HTTP1_Session+0xef) [0x44fc6f]
0x43a5ad: varnishd(SES_Proto_Req+0x5d) [0x43a5ad]
0x44a522: varnishd() [0x44a522]
0x44a98b: varnishd() [0x44a98b]
0x74c77a6fb0a4: libpthread.so.0(+0x80a4) [0x74c77a6fb0a4]
0x74c77a43087d: libc.so.6(clone+0x6d) [0x74c77a43087d]
req = 0x74c6e4c06020 {
vxid = 0, step = R_STP_RESTART,
req_body = R_BODY_INIT,
restarts = 0, esi_level = 0,
sp = 0x74c715f8b220 {
   fd = 79, vxid = 26927013,
   client = X.X.X.X 27212,
   t_open = 1475250171.316174,
   t_idle = 1475250171.645251,
   step = S_STP_H1PROC,
},”
If the panic is not present, then you’ll have to go through “/var/log/syslog” and look for anything Varnish related. I usually grep for “die and/or kill” within the syslog output to check whether Varnish has indeed crashed or been killed.

Once you have found the panic or any other odd behavior you can go through the open Varnish Cache issues (https://github.com/varnishcache/varnish-cache/issues) and research whether it is a known open/solved issue.

If it is a solved issue, then all you need to do is upgrade to the latest Varnish Cache or Varnish Cache Plus version. If it is not solved or still unknown you can get in touch with Varnish Software support.

If you enjoy these tutorial-style blog posts and want to learn more about Varnish, you might want to sign up for our upcoming live webinar, 4 ways to speed up your web and mobile performance.

Don't bug me: how to debug and fix Varnish

Top 10 things you need to know about Varnish

Zipnish: The whys and hows

Unlock the Secrets of Varnish with Stack Overflow Insights: A Deep Dive with the Experts

Don't bug me: how to debug and fix Varnish

You may also like this

Top 10 things you need to know about Varnish

Zipnish: The whys and hows

Unlock the Secrets of Varnish with Stack Overflow Insights: A Deep Dive with the Experts

SUBSCRIBE TO OUR BLOG

SEARCH OUR BLOG