August 26, 2020
6 min read time

Bot identity verification in Varnish

bot_id_varnish

Are you sure the robots visiting your website are who they say they are? The newest addition to the VMOD roster in Varnish Enterprise, VMOD Resolver, brings bot identity verification to a VCL near you!

 

blog-graphics-1-1024x536

 

Trust, but verify

If you want to appear in search results, you have to let search engines index your content. You may already be letting web crawlers from some search engines through the paywall on your website, giving them access to your premium content. But are the crawlers you let through the gates legitimate, or are they big, wooden, and horse-shaped?

Identifying and verifying web crawlers from a search engine can be as simple as adding a list of IPs to an Access Control List (ACL) in your VCL configuration. For example, the search engine DuckDuckGo publishes statically assigned IPs to their "DuckDuckBot" web crawlers. But other search engines such as Google and Bing will dynamically assign IPs to their web crawlers. An ACL-based approach would therefore be unreliable for both GoogleBots and BingBots, meaning another method of identification and verification should be used.

It has for a long time been possible to identify web crawlers like GoogleBot in VCL. By matching the regular expression (?)googlebot against the clients User-Agent request header, you can create special rules for all clients claiming to be a GoogleBot. While all legitimate GoogleBots will pass this test and get access to your premium content, so will any other client with a matching, easily spoofed, User-Agent request header. To be sure that a client is a legitimate web crawler, you have to verify its identity.

Verify a bot with VMOD Resolver

VMOD Resolver brings Forward Confirmed reverse DNS (FCrDNS) to Varnish Enterprise. Often used to verify the domain of email senders, this method of identity verification may also be used to verify web crawlers. In fact, both Google and Bing recommend this method of verification for their crawlers.

The verification procedure is simple: First, a reverse DNS lookup on the client IP gives us the Fully Qualified Domain Name (FQDN) of the client. Then, a forward DNS lookup on the FQDN gives us the a list of associated IPs. If any of these IPs match the original client IP, the relationship between IP and domain is verified, and we can check wether the client is coming from the expected domain.

To verify the identity of all clients claiming to be GoogleBot, you can add the following to your VCL file:

import resolver;
import str;

sub vcl_recv {
unset req.http.verified-bot;
if (req.http.User-Agent ~ "(?i)googlebot" && resolver.resolve()) {
if (str.endswith(resolver.domain(), ".googlebot.com") ||
str.endswith(resolver.domain(), ".google.com")) {
set req.http.verified-bot = "googlebot";
}
}
}

 

One caveat to note: For FCrDNS verification to succeed, a circular DNS mapping between IP and FQDN must exist. The previously mentioned DuckDuckBots are good examples of how this is not always the case, even for honest clients. An ACL based approach is therefore still recommended for DuckDuckBots and any other crawlers that cannot be verified with FCrDNS.

Official documentation for VMOD Resolver

 

Verify lots of bots with veribot.vcl

Are you looking at the above example and wishing you had an easy way to grant access based on a bunch of domains and User-Agents? If so, veribot.vcl is what you are looking for.

Bundled with VMOD Resolver is a VCL file called veribot.vcl. This VCL combines the domain resolution capabilities of VMOD Resolver with the list manipulation capabilities of VMOD Rewrite, and the caching capabilities of VMOD Kvstore. The result is a powerful domain-based access control tool that can be easily integrated with your existing VCL.

Official documentation for veribot.vcl

If you want to understand the internal workings of veribot.vcl, and maybe modify it to fit your particular use case, the following is the complete contents of the VCL:

import rewrite;
import kvstore;
import resolver;

sub vcl_init {
new vb_domains = kvstore.init();
new vb_ua_filter = rewrite.ruleset(string = {""}, type = regex, min_fields = 1);
new vb_domain_rules = rewrite.ruleset(string = {""}, type = suffix, min_fields = 2);
}

sub vb_check_client {
unset req.http.vb-error;
unset req.http.vb-access;
unset req.http.vb-domain;

// Filter out irrelevant clients based on User-Agent
if (!vb_ua_filter.match(req.http.User-Agent)) {
set req.http.vb-error = "VB: Client UA did not pass UA filter";
return;
}

// Check for a previous resolution of the client ip in cache
set req.http.vb-domain = vb_domains.get(client.ip, "");

// If not in cache, resolve the client domain name and cache it
if (req.http.vb-domain == "") {
if (!resolver.resolve()) {
unset req.http.vb-domain;
set req.http.vb-error = "VB: Resolve failed with: " + resolver.error();
return;
}
set req.http.vb-domain = resolver.domain();
vb_domains.set(client.ip, req.http.vb-domain);
}

// Suffix-match the the client domain name
if(!vb_domain_rules.match(req.http.vb-domain)) {
set req.http.vb-error = "VB: Client domain did not match any rule";
return;
}

set req.http.vb-access = vb_domain_rules.rewrite(2, mode = only_matching);
}

 

 

reach_people_faster