Varnish Custom Statistics and the world

Written by Guillaume Quintard | 8/30/16 12:00 PM

I quite like the Varnish Custom Statistics, the idea behind it is super simple (aggregate data about classes of requests) and yet, its use cases are extremely diverse: people use it to monitor most requested URLs, to watch for brewing backend issues, to do A/B testing, or to create image walls to show the most read articles (here's the article about how it's done).

And today I'd like to show you yet another use case: visualizing on a world map where your users are, hence the click-bait-y title.

Foreword

But before we start, I have to confess something to you: I don't like web development - at all. I find the html/css/js triplet to be a pain to work with that's only made worse by the various different browser implementations. As a C developer, JavaScript is alien and weird to me; CSS is convoluted and html is, after all, xml, and as such, should probably die in a fire.

However, I have to admit that I see the appeal: you can easily prototype, handle both logic and presentation, rely on billions of online examples and modules/plugins/libraries and of course iterate crazy fast. So yeah, I don't like the technology, but it really lowers the entry barrier for developers, and that's a good thing.

And so, this project has be coded in JavaScript, and it was actually pretty painless, even for a hater like me. I guess I'm becoming more mature as I grow old (up?), or maybe I just had big misconceptions about webdev, who knows?

The plan

The idea is to create a web page showing us a map of the world, painting countries according to the number of requests that came from them (the more requests, the darker the shade). Something like this:

For this, we are going to use:

Varnish: You are on Varnish Software's blog, after all. Joking aside, Varnish being the entry point of your platform, it will see all requests, and so will have all the information needed.
Varnish Custom Statistics: VCS will collect all sorts of data about certain requests and categorize them using tags and it can do so for clusters of Varnish, not just individual instances.
vmod-geoip: Using libgeoip (possibly with a free database) this VMOD can translate IP addresses to country names or ISO "ALPHA-2" codes.
jqvmap: This is a pretty cool JavaScript map framework, and since I'm a total n00b in JavaScript, this puppy is going to do the heavy lifting for me (mandatory, related image)

And that's about it, let's set things up!

The VCS side

Hold your breath, don't blink, this is going to be super quick.

VCS actually consists of two software components: the server and the probe. The VCS probe is run on each Varnish server, reads the shmlog and pushes data to the VCS server, which most of the time is on a separate machine.

The server is started with:

vstatd

Yes, the binary is still called "vstatd", the old name of the product, but that's not important. We'll use the default ports, time window sizes and numbers.

And we have to start the probe(s), telling it where the VCS server is (let's say 192.168.0.200):

vstatdprobe 192.168.0.200

And that's it, you can stop holding your breath now. What requests are collected and how is completely driven by VCL, explaining the lack of configuration here.

The Varnish side

It won't actually be much more complicated, first you need to:

install libgeoip, and maybe a database, such as this one (the Arch Linux package bundles it, so I had no extra work to do).
download, compile and install vmod-geoip, it's now straightforward in Varnish 4.X if the dev packages are installed.

Then, we just have to add a few lines to our VCL:

import std;
import geoip;

sub vcl_recv {
    std.log("vcs-key: FROM-" + geoip.country_code(client.ip));
}

Done! For each request, we are going to log the country code of the country, prefixed with "vcs-key: FROM-", where "vcs-key:" is a marker announcing to VCS that the string should be used to tag the request, and "FROM-" is just a string for us to help with filtering.

To check that it works, let's run:

varnishlog -i VCL_Log -g raw

"-g raw" removes all grouping, and "-i VCL_Log" filters only VCL_Log lines, in other words, messages coming from std.log(). The result should look a bit like:

1977337 VCL_Log c vcs-key: FROM-US
2002175 VCL_Log c vcs-key: FROM-CN
1977340 VCL_Log c vcs-key: FROM-CN
2002178 VCL_Log c vcs-key: FROM-US
1977343 VCL_Log c vcs-key: FROM-KR
2002181 VCL_Log c vcs-key: FROM-US
1977346 VCL_Log c vcs-key: FROM-CA
2002184 VCL_Log c vcs-key: FROM-US
1977349 VCL_Log c vcs-key: FROM-CA
2002187 VCL_Log c vcs-key: FROM-CN
1977352 VCL_Log c vcs-key: FROM-US
2002190 VCL_Log c vcs-key: FROM-Unknown
1977355 VCL_Log c vcs-key: FROM-Unknown
2002193 VCL_Log c vcs-key: FROM-IR
1977358 VCL_Log c vcs-key: FROM-FR

Which is not too surprising; that's what we asked for. There are a few unknown IPs, but after all, we are using a free, lower quality database, so that's normal.

The VCS API

Data is starting to pour into our VCS server; let's see what's available.

With the endpoint /all/, we can retrieve all the vcs-keys seen and currently in memory:

curl $VCSIP:$VCSPORT/all/

{
    "keys": [
    "FROM-Unknown",
    "FROM-A2",
    "FROM-AD",
    "FROM-AE"
    ]
}

To get info about one key (i.e., all the requests flagged using this tag), /key/STRING is used:

curl $VCSIP:$VCSPORT/key/FROM-RU

{
    "FROM-RU": [
        {
            "timestamp": "2016-08-02T18:06:30",
            "n_req": 42,
            "n_req_uniq": "NaN",
            "n_miss": 42,
            "avg_restarts": 0.000000,
            "n_bodybytes": 11928,
            "reqbytes": 3874,
            "respbytes": 22050,
            "berespbytes": 0,
            "bereqbytes": 0,
            "ttfb_miss": 0.000166,
            "ttfb_hit": "NaN",
            "resp_1xx": 0,
            "resp_2xx": 0,
            "resp_3xx": 0,
            "resp_4xx": 0,
            "resp_5xx": 42
        },
        {
            "timestamp": "2016-08-02T18:06:00",
            "n_req": 43,
            "n_req_uniq": "NaN",
            "n_miss": 43,
            "avg_restarts": 0.000000,
            "n_bodybytes": 12212,
            "reqbytes": 3968,
            "respbytes": 22575,
            "berespbytes": 0,
            "bereqbytes": 0,
            "ttfb_miss": 0.000180,
            "ttfb_hit": "NaN",
            "resp_1xx": 0,
            "resp_2xx": 0,
            "resp_3xx": 0,
            "resp_4xx": 0,
            "resp_5xx": 43
        },
        {
            "timestamp": "2016-08-02T18:05:30",
            "n_req": 44,
            "n_req_uniq": "NaN",
            "n_miss": 44,
            "avg_restarts": 0.000000,
            "n_bodybytes": 12496,
            "reqbytes": 4042,
            "respbytes": 23100,
            "berespbytes": 0,
            "bereqbytes": 0,
            ...

As you can see, data is aggregated in windows of 30 seconds (look at the timestamps) by default, giving you almost real-time feedback on how your data is consumed. Here we can tell that we get around 40 requests from Russia every 30 seconds, generating 22k of traffic to the clients. And we can also tell that I should fix my backend since all the requests received 5XX responses (truth is, I got lazy and didn't start the backend).

Let's finish on a more complex request, which is actually the one we are going to use:

curl $VCSIP:$VCSPORT/match/FROM-/top/300?b=10

This asks VCS:

to return only the keys matching "FROM-".
to return only the 300 most requested keys. We are good anyway since there are fewer countries than that, but it will force VCS to count and show the number of requests in the results, instead of just displaying the keys.
to use the last five time windows to compute the most requested keys, instead of only using the last one.

The result should look like this:

{
    "FROM-US": 14120,
    "FROM-Unknown": 5778,
    "FROM-CN": 2962,
    "FROM-JP": 1817,
    "FROM-GB": 1100,
    "FROM-DE": 1060,
    "FROM-KR": 1023,
    "FROM-BR": 756,
    "FROM-FR": 741,
    "FROM-CA": 698,
    "FROM-IT": 474,
    "FROM-NL": 444,
    "FROM-AU": 437,
    "FROM-RU": 421,
    "FROM-IN": 361,
    "FROM-TW": 299,
    ...

And this is what we are going to use in our JavaScript, which we are now ready to write.

Enter Mordor

Before we start, let me state that again: this is not my turf, and I did what most new coders do: I stole code, specifically from the jqvmap README, but in my defense, the example given was doing pretty much what I needed.

Some requirements

As said at the beginning, we are going to use jqvmap, meaning we need to include 4 elements to our HTML page:

jqvmap's css, so our map is all nice and fancy
jquery, because nothing is pure js anymore and jqvmap heavily uses this framework
jqvmap's code, that's to be expected
a world map. jqvmap can plot any map, and has quite a collection, but right now, we are interested in a world map.

HTML code is:

<script type="text/javascript" src="http://code.jquery.com/jquery-1.11.3.min.js">
<script type="text/javascript" src="https://rawgit.com/manifestinteractive/jqvmap/master/dist/jquery.vmap.js">
<script type="text/javascript" src="https://rawgit.com/manifestinteractive/jqvmap/master/dist/maps/jquery.vmap.world.js" charset="utf-8">

Note that for the last two, I just used rawgit so I could avoid hosting the code while still executing it.

And I'll also create an empty div for jqvmap to populate:

<div id="vmap" style="width: 100%; height: 90%;"></div>

Show us the code!

Ok, everything is in place. Now we just have to create the map, and update it every 20 seconds, here's the map creation that will happen once the page is loaded:

var g_reqs = {};

function mapUpdate() {
    $.getJSON("http://127.0.0.1:8888/match/FROM-/top/300?b=20&callback=?", parseAndShow);
};

function labelShow(event, label, code) {
	label.text(g_req[code] +
                    " requests originated from " +
                    JQVMap.maps['world_en'].paths[code].name)
}

function regionClick(event, label, code) {
    event.preventDefault();
}

jQuery(document).ready(function() {
    jQuery('#vmap').vectorMap({
        map: 'world_en',
        hoverColor: '#005aff',
        scaleColors: ['#d8f8ff', '#005ace'],
        onRegionClick: regionClick,
        onLabelShow: labelShow, 
        normalizeFunction: 'polynomial'
    });
    mapUpdate();
});

Some explanation about the vectorMap() arguments:

map: what map should be used, we only loaded one here, so there's not much suspense.
hoverColor: the default color for highlighted zone is green, but the Varnish color is blue, so I needed to adapt it.
scaleColors and normalizeFunction: we are going to give per-country values to jqvmap, and these two parameters direct how they will be translated into colors, scaleColors being the lower/upper bounds, and normalizeFunction how colors are going to be spread in the interval.
onRegionClick: give a callback to run when a country is clicked, and that callback (regionClick) actually prevents the default behavior.
onLabelShow: there's a label under the map, and we can use a callback to put whatever text we want in it.

But vectormap() only creates a blank map, so we need to color it. Thankfully, jqvmap will do most of the job for us, and we only have to give it a dictionary looking like:

{ "us": 34, "ru": 54, "fr": 23, ...}

i.e., using the lowercase country code as keys and the number of requests as values, coloring will happen automagically using scaleColors and normalizeFunction.

This happens in two steps:

grab the data; this is done in mapUpdate
adapt the data from VCS to fit jqvmap, that means converting keys from "FROM-XX" to "xx", and making sure all countries are represented:

function parseAndShow(data) {
    var countries = [
        'af', 'ax', 'al', 'dz', 'as', 'ad', 'ao', 'ai', 'aq', 'ag', 'ar', 'am',
        'aw', 'au', 'at', 'az', 'bs', 'bh', 'bd', 'bb', 'by', 'be', 'bz', 'bj',
        'bm', 'bt', 'bo', 'ba', 'bw', 'bv', 'br', 'io', 'bn', 'bg', 'bf', 'bi',
        'kh', 'cm', 'ca', 'cv', 'ky', 'cf', 'td', 'cl', 'cn', 'cx', 'cc', 'co',
        'km', 'cg', 'cd', 'ck', 'cr', 'ci', 'hr', 'cu', 'cy', 'cz', 'dk', 'dj',
        'dm', 'do', 'ec', 'eg', 'sv', 'gq', 'er', 'ee', 'et', 'fk', 'fo', 'fj',
        'fi', 'fr', 'gf', 'pf', 'tf', 'ga', 'gm', 'ge', 'de', 'gh', 'gi', 'gr',
        'gl', 'gd', 'gp', 'gu', 'gt', 'gg', 'gn', 'gw', 'gy', 'ht', 'hm', 'va',
        'hn', 'hk', 'hu', 'is', 'in', 'id', 'ir', 'iq', 'ie', 'im', 'il', 'it',
        'jm', 'jp', 'je', 'jo', 'kz', 'ke', 'ki', 'kr', 'kw', 'kg', 'la', 'lv',
        'lb', 'ls', 'lr', 'ly', 'li', 'lt', 'lu', 'mo', 'mk', 'mg', 'mw', 'my',
        'mv', 'ml', 'mt', 'mh', 'mq', 'mr', 'mu', 'yt', 'mx', 'fm', 'md', 'mc',
        'mn', 'me', 'ms', 'ma', 'mz', 'mm', 'na', 'nr', 'np', 'nl', 'an', 'nc',
        'nz', 'ni', 'ne', 'ng', 'nu', 'nf', 'mp', 'no', 'om', 'pk', 'pw', 'ps',
        'pa', 'pg', 'py', 'pe', 'ph', 'pn', 'pl', 'pt', 'pr', 'qa', 're', 'ro',
        'ru', 'rw', 'bl', 'sh', 'kn', 'lc', 'mf', 'pm', 'vc', 'ws', 'sm', 'st',
        'sa', 'sn', 'rs', 'sc', 'sl', 'sg', 'sk', 'si', 'sb', 'so', 'za', 'gs',
        'es', 'lk', 'sd', 'sr', 'sj', 'sz', 'se', 'ch', 'sy', 'tw', 'tj', 'tz',
        'th', 'tl', 'tg', 'tk', 'to', 'tt', 'tn', 'tr', 'tm', 'tc', 'tv', 'ug',
        'ua', 'ae', 'gb', 'us', 'um', 'uy', 'uz', 'vu', 've', 'vn', 'vg', 'vi',
        'wf', 'eh', 'ye', 'zm', 'zw', 'kp' 
    ];

    var reqs = {};
    $.each(countries, function(idx, key) {
        var ckey = "FROM-" + key.toUpperCase();
        if (data[ckey]) {
            reqs[key] = data[ckey];
        } else {
            reqs[key] = 0;
	    }
    });

    g_req = reqs;
    jQuery('#vmap').vectorMap('set', 'values',  reqs);
    setTimeout(mapUpdate, 20000);
}

At the end, I set a timer to rerun mapupdate 20 seconds later, and I updated g_reqs so that labelShow can use it when I hover over a country.

And we are done! The maps in the page are static to avoid running VCS ad vitam aeternam just for a blog post, but if you wish to see the full "actual" code, it's here.

From here to there

Of course, you can make the maps even sexier by adding tooltips, fancier colors and cool effects, but as you can guess, I leave this as an exercise to you, the reader.

BUT, there's one last thing I wanted to show you before we part ways, a quick change for a deep addition. Let's say we add one line to our VCL:

import std;
import geoip;

sub vcl_recv {
    std.log("vcs-key: FROM-" + geoip.country_code(client.ip));
    std.log("vcs-key: TO-"   + geoip.country_code(server.ip));
}

We are now recording not only the client's country but the destination's. True, you need more than one point of presence for this to be interesting, but the point is that with very few changes to your JavaScript, you can get the original map mapping the client's activity to this one, mapping the server's activity:

Where you can see in a quick glance that your Japanese servers are getting more than their share of requests.

Conclusion

VCS is a generic tool, offering great versatility and super easy integration, notably with JavaScript that bundles HTTP+JSON directly into the language as we have seen here. But this is only a very specific example, made to kickstart your creativity and make you think about how it can be useful for YOU and your Varnish usage.

Data analysis is already a crucial part of running a website, and is not limited to just bandwidth and requests per second. Combined with Varnish, VCS can be the tool to give you the necessary insight on who your public is and how your content is consumed to create a better, more efficient service.

Ready to learn more about VCS? Join us for our live webinar, How to identify issues in Varnish and track web-traffic stats in real-time: Getting the most out of Varnish Custom Statistics on September 8th.

View full post