TinyKVM in Varnish and some Deno

A little bit about compute in Varnish Cache and some Deno JS benchmarks

Hey all. I recently wrote about TinyKVM, a sandbox with native performance. This time I want to write about how you can try it out as a compute framework in Varnish Cache. I’ve also been very anxious (to say the least) about whether or not my theories hold up in practice. Is TinyKVM really the fastest way to sandbox compute workloads? What about per-request isolation? I’ve invited Laurence Rowe to write with me about his adventure in embedding the Rust-based Deno JavaScript runtime in TinyKVM.

I wrote previously that TinyKVM runs regular Linux ELFs and I use this to test my programs on the terminal like a regular program before I load them in Varnish. When running TinyKVM embedded in Varnish, there is a small API to facilitate receiving a HTTP request and writing back a response. Which is really the crux of the whole thing: I specifically designed it to be just a dumb “here’s a request” and “where’s the response?” type of thing:

  
    
    
    
    
      int main(int argc, char **argv)
{
  if (IS_LINUX_MAIN()) {
    puts("Hello Linux Terminal World!");
    return 0;
  }

  while (true) {
    struct backend_request req;
    wait_for_requests_paused(&req);

    http_appendf(RESP, "X-Hello: %s", "World");
    backend_response_str(200, "text/plain", "Hello Compute World!");
  }
}
    

The wait_for_requests_paused() inside the while loop will receive requests as they come. If the program crashes at any point it will be forcibly reset back to the state at the time of pausing. Alternatively, you can configure programs to be reset back to the paused state after every request. So-called ephemeral VMs — as they are unable to persist anything. In effect it becomes per-request isolation.

  
    
    
    
    
      #include "kvm_api.hpp"
#include <simdjson/minify.h>
static std::vector<char> buffer(16ULL << 20); // 16MB buffer

static void
on_post(const char *, const char *,
const char *content_type, const uint8_t* data, const size_t len)
{
   /* Minify the JSON at 6GB/s */
   size_t buffer_len = buffer.size();
   (void)simdjson::minify((const char *)data, len, buffer.data(), buffer_len);

   /* Respond with the minified JSON */
   Backend::response(200, "application/json", buffer.data(), buffer_len);
}

int main(int argc, char **argv)
{
   set_backend_post(on_post);
   wait_for_requests();
}
    

Above: A JSON minification program using simdjson. A static buffer is used in order to keep memory usage for non-ephemeral in check. When the program is ephemeral there is no need to track memory usage, because it will get completely reset after the request completes. In other words, as long as the program is able to deliver a response, it’s fine, as it will start over in a pristine condition on the next request. Programs that rely on GC can be faster with ephemeral compared to non-ephemeral because the potentially time-consuming GC is never allowed to run.

A program example

TinyKVM Blog Image 1

I shamelessly used my own GBC emulator.

One of the TinyKVM example programs embeds a GBC emulator I wrote in 2017, which is presented as a webpage lightly strewn with some inline CSS wizardry by my friend Kyle. I once embedded it in IncludeOS, added PS/2 keyboard support and sent it to the Qemu advent calendar. It got accepted! The ROM it was using was a city builder homebrew game called µCity Advance made by Antonio Niño Díaz, which I had his permission to use!

I made this just for fun, of course. It provides co-operative gameplay where you fight for the controls. Good luck!

Shared mutable storage

I want to explain a little bit about how the co-operative GBC gameplay makes progress despite being served by separate request VMs with no knowledge of each other.The program uses something I just call storage, and it is a fork of the main VM that the program gets initialized in. During startup you set everything up, then a bunch of light-weight forks are created off of the main VM, which are henceforth referred to as request VMs. These request VMs are then prepared to handle requests, as they are all pre-initialized. What remains behind is the main VM, which can now be used for something else: Shared mutable storage. So, storage is a special instance of the program. And to make use of storage you can (among other things) call allowed functions to pass data in and out. Since the storage VM shares the constant parts of the program with your request VMs, every static (and static-PIE) link-time symbol has the same address. Because of this, calling a function in storage is as simple as passing an existing function as argument: storage_call(my_function, …) will then call my_function in storage, as long as storage allows it. Or you can use shared memory.

In the example GBC program we call a function in the storage VM from a client request with the gamepad inputs, which then contributes to the next frame. The data we return from storage is an encoded PNG of the current frame.

One thing I noticed is that keys tend to be held for a very long time (compared to a single frame) until they change. So I also made it predict the next frame by assuming the same input. So:

A request VM calls a storage function
Is it time to simulate the next frame? Otherwise deliver old frame
Check if the frame has already been predicted, and if so, return prediction
If the frame wasn’t predicted, generate a new frame and return that
Schedule a predicted frame during downtime

The last point is the important one: We can schedule something to happen in-between storage requests. A function call which storage will schedule outside of request VM accesses. This reduces latency because most of the work is in encoding the PNG. You can find the source code here.

There’s more stuff, of course. Like the ability to hotswap out the GBC program with an updated version while keeping the state. We can make the Gameboy green without restarting the game! But I have to stop now or it will never end.

Languages with some API support

I’ve made APIs for these languages so far, with varying degrees of completeness:

C
C++
Go
JavaScript v8 w/JIT
JavaScript Deno w/JIT
Kotlin
Nelua
Nim
Rust
Zig

Have a look at the program examples repository. C/C++ has the most complete API with languages that understand C headers coming in very close, like Zig. Zig is really up there, but I do think that people who write Zig would appreciate a more Zig-idiomatic API despite it understanding C headers. Right? I asked for an opinion on my Rust API and it got slaughtered. It’s clearly more C-like than I had realized. People have also asked for Python examples, and I will not work on that until after loading dynamically linked executables is supported. It’s coming, though!

I will add that if you feel the API for a language is lacking, make an issue and let me know. My time isn’t infinite, but I’ll do my best.

Otherwise, there’s example programs for many things. There’s WebP and AVIF transcoders. Zstandard and gzip compressors. The usual things.

The Deno JavaScript run-time

I invited Laurence Rowe to write this bit. He does an excellent job running realistic benchmarks, comparing them to existing solutions.

With most web UIs now written in JavaScript, server rendering is necessary to provide the best experience for initial page loads. Ideally such UI code would be run with per-request isolation to avoid the possibility of leaking data between requests — all too easy if a variable declaration is placed at the wrong level — but current options are prohibitively slow.

Existing approaches for per-request isolation use V8 isolates or process forking but these incur several milliseconds of additional latency. WebAssembly can provide microsecond latency for per-request isolation but is incompatible with JIT, so slower execution outweighs the lower startup latency when executing more complicated JS such as rendering a page with React.

Seeing TinyKVM’s performance numbers was really exciting, it clearly works extremely well for many types of programs so how would it fare with JavaScript? V8 alone lacks much of the web platform support necessary to easily run much real world code. Deno provides a mature, fully featured JS runtime built on top of V8 which is largely compatible with the web platform and Node.

The current proof of concept implementation achieves per-request isolation with ~0.4ms of additional latency when rendering a complex page with React taken from a real site. Median rendering times on my system are 0.57ms under stock Deno without per-request isolation and with GC running on background threads. 0.72ms running single threaded under TinyKVM without per-request isolation and 0.95ms with per-request isolation.

This appears to be the fastest option currently for running substantial JavaScript programs under per-request isolation by an order of magnitude! Executables with smaller memory footprints have even lower latencies on TinyKVM and we will continue looking for further optimizations.

The current proof of concept builds a static executable that runs deno_runtime in single threaded mode and provides just enough integration with the Varnish TinyKVM API to run some benchmarks.

Support was added to TinyKVM to run Rust’s crt-static support used when statically linking with glibc, avoiding the complexity of building Deno under musl. And a new wait_for_requests_paused API was added to allow JS to synchronously call into the Varnish API as a host function, avoiding an extra trip around the event loop that the callback API would require.

It’s not yet clear how Deno under TinyKVM will look longer term. It’s possible it could become a Deno extension if Deno adds an option to run single threaded and TinyKVM gains the ability to run dynamically linked executables. But there are many V8 build options to explore which might make a custom build worthwhile.

A gzip benchmark and hugepages in Deno

I’m happy that per-request isolation is performing well. I like doing benchmarks now and then, and I also recently did a quick gzip benchmark against the internal zlib in Varnish:

TinyKVM Blog Image 2

libdeflate is indeed as fast as they claim

Zlib-ng provides at least a 33% performance boost over the zlib embedded in Varnish for these relatively small payloads. libdeflate claims to be significantly faster than alternatives, and really delivers on that promise.

Otherwise, benchmarking can be puzzling sometimes:

  
    
    
    
    
      $ ./wrk -c4 -t4 http://127.0.0.1:8080/ -H "Host: deno"
Running 10s test @ http://127.0.0.1:8080/
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   103.33us   14.54us   1.16ms   78.07%
    Req/Sec     9.68k   385.52    10.65k    69.31%
  389339 requests in 10.10s, 56.07MB read
Requests/sec:  38549.93
Transfer/sec:      5.55MB

$ ./wrk -c4 -t4 http://127.0.0.1:8080/ -H "Host: deno"
Running 10s test @ http://127.0.0.1:8080/
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    91.17us   11.30us 848.00us   77.06%
    Req/Sec    10.97k   435.97    12.22k    71.29%
  440963 requests in 10.10s, 63.50MB read
Requests/sec:  43663.15
Transfer/sec:      6.29MB
    

This is a benchmark of per-request isolation of a small Deno JS program first without hugepages and then with them enabled. Not for the whole main memory, and not for all of the working memory in request VMs. Just enough pages each to cover read-only segments in the main VM and the hot path in request VMs. One could speculate that hugepages reduces the number of IPIs required per request in addition to the reduced pagetable walking. Either way, it’s a 12% performance boost from a run-time setting, as the program is unmodified. I think that’s a source of my continued bewilderment when doing all this: We’re gaining a lot of performance without recompiling programs just by enabling a setting in TinyKVM.

Fun fact: Laurence’s Deno JS UI renderer.js is at 252k lines of JavaScript!?

Conclusion

We can see that TinyKVM provides high performance sandboxing, not just in raw compute but also in per-request isolation:

  
    
    
    
    
      $ ./wrk -c1 -t1 http://127.0.0.1:8080/ -H "Host: test.com"
Running 10s test @ http://127.0.0.1:8080/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.16us    1.43us 202.00us   96.19%
    Req/Sec    69.41k     3.26k   72.51k    90.10%
  697892 requests in 10.10s, 90.52MB read
Requests/sec:  69100.43
Transfer/sec:      8.96MB
    

It can perform per-request isolation of a tiny program in on average 14us in a HTTP benchmark on my machine. That’s 14 micros end-to-end!

The TinyKVM compute framework in Varnish aims to trivialize data processing. Being directly embedded in Varnish it provides access to the cache and the ability to directly cache data. I hope that some of what the framework can do has been made clear.

It’s weight off my shoulders to know that there isn’t anything fundamentally badly scaling in TinyKVM. It’s also weird to think about that I made a whole new reset mechanism because of the Deno runtime. But I will say that I’m glad that I did. The new reset mechanism proves quite a bit faster for certain programs! Is it really a magnitude faster than other solutions for per-request isolation!?

Authored by Laurence Rowe and Alf-André Walla. Blog originally posted here.

TinyKVM in Varnish and some Deno

A program example

Shared mutable storage

Languages with some API support

The Deno JavaScript run-time

A gzip benchmark and hugepages in Deno

Conclusion

Intel® Converged Edge Media Platform & Varnish Enterprise

TinyKVM: The Fastest Sandbox

Varnish and Observability (Part 1: The Basics)

TinyKVM in Varnish and some Deno

A program example

Shared mutable storage

Languages with some API support

The Deno JavaScript run-time

A gzip benchmark and hugepages in Deno

Conclusion

You may also like this

Intel® Converged Edge Media Platform & Varnish Enterprise

TinyKVM: The Fastest Sandbox

Varnish and Observability (Part 1: The Basics)

SUBSCRIBE TO OUR BLOG

SEARCH OUR BLOG