An introduction to a KVM-based single-process sandbox
Hey All. In between working on my PhD, libriscv and an untitled game (it’s too much I know), I also have been working on a KVM sandbox for single programs. A so-called userspace emulator. I wanted to make the worlds fastest sandbox using hardware virtualization, or at least I had an idea of what I wanted to do.
I wrote a blog post about sandboxing each and every request in Varnish back in 2021, titled Virtual Machines for Multi-tenancy in Varnish. In it, I wrote that I would look into using KVM for sandboxing instead of using a RISC-V emulator. And so… I went ahead and wrote TinyKVM.
So, what is TinyKVM and what does it bring to the table?
TinyKVM executes regular Linux programs with the same results as native execution.
TinyKVM can be used to sandbox regular Linux programs or programs with specialized APIs embedded into your servers.
TinyKVM’s design
In order to explain just what TinyKVM is, I’m just going to list explicit features that are currently implemented and working as intended:
TinyKVM runs static Linux ELF programs. It can also be extended with an API made by you to give it access to eg. an outer HTTP server or cache. I’ll also be adding dynamic executable support eventually. ⏳ It currently runs on AMD64 (x86_64), and I will port it to AArch64 (64-bit ARM) at some later point in time.
TinyKVM creates hugepages where possible for guest pages. It can also use hugepages on the host in addition. The result is often (if not always) higher performance than a vanilla native program. Just to hammer this a bit in: https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-For-Code found that just allocating 2MB pages for the execute segment gave a 5% compilation boost for the LLVM codebase.
I quickly allocated some hugepages and ran TinyKVM w/STREAM, and yes it’s quite a bit faster.
TinyKVM has only 2us overhead when calling a function in the guest. This may seem like much compared to my RISC-V emulators 3ns, however we are entering another process, and we get to use all of our CPU features.
TinyKVM can halt execution after a given time without any thread or signal setup during the call. This is unavailable to regular Linux programs. With no execution timeout the call overhead is 1.2us, as we don’t need a timer anymore.
TinyKVM can be remotely debugged with GDB. The program can be resumed after debugging, and I’ve actually used that to live-debug a request in Varnish and see it complete normally afterwards. Cool stuff, if I may say so.
TinyKVM can fork itself into copies that use copy-on-write to allow for huge workloads like LLMs to share most memory. As an example, 6GB weights required only 260MB working memory per instance, making it highly scalable.
TinyKVM forks can reset themselves in record time to a previous state using mechanisms unavailable to regular Linux programs. If security is important, VMs can be made ephemeral by resetting them after every request. Thus removing the possibility of seeing traces of previous requests and many classes of attacks are made impossible as any form of persistence is no longer possible. A TinyKVM instance can also be reset to another VM it was not forked from at a performance penalty, as the pagetables have to change.
TinyKVM uses only a fraction of the KVM API, which itself is around 42k LOC. The total lines of code covered by TinyKVM at run-time is unknown, but it is probably less than 40k LOC in total due to not using any devices, or any drivers. This can be compared to eg. 350k wasmtime and 165k FireCracker, which are both large enough to also ideally be run with a process jail on top. FireCracker “ships” with a jailer. For TinyKVM, specifically, KVM codebase is around 7k + 37k for base + x86, and TinyKVM is around 9k LOC.
TinyKVM creates static pagetables during initialization, in a way which works even for strange run-times like Go. The only pages that may be changed afterwards are copy-on-write pages. We can say that the pagetables are largely static, which is a security benefit. They may not be modified after start, and there are run-time sanity checks in place during KVM exits.
A KVM guest has separate PCID/ASID similar to a process, and so for the purposes of current and future speculation bugs, it will likely be immune.
The TinyKVM guest has a tiny kernel which cannot be modified. SMEP and SMAP is enabled as well as CR0.WP and page protections. The whole kernel uses 7x 4k pages in total. We also enter in usermode and try to avoid kernel mode except where impossible. Did you know that you can handle CPU exceptions in usermode on x86?
System calls
An API toward the host is created by defining and hooking up system calls in the emulator. In the guest they are implemented both by the SYSCALL/SYSRET and you can also use an OUT instruction directly from usermode, the latter being the lowest latency. This then causes a VM exit, and registers are inspected to find that a system call is being performed. All in all, the round-trip is a bit less than ~1 microsecond, which is quite different from libriscv’s 2ns. But, you can design a smaller API with larger inputs and outputs, instead of many smaller calls (which makes sense when its cheaper).
Benchmarks
The overhead of resetting the VM + calling into the VM guest.
For call overhead, we measure the cost of resetting the VM as tail latency, because it can be paid while (for example) a HTTP response is already underway. If you don’t need to reset, simply ignore the red. Reset time scales with working memory usage.
Memory benchmarks are done to see if there’s anything obviously wrong, and there is not. One of my favorite benchmarks was encoding 1500 AVIF images per second in a HTTP benchmark (AVIF encoder as a service):
AVIFs can be transcoded from JPEGs quite fast with the right methods.
So we can see that 16x98.88 = 1582 image transcodings per second. Of course the transcoder settings and image size matters, but I am more interested in the scaling aspect. Does doubling concurrency increase our overall performance up to a point? Yes, it does.
Fun fact: Did you know that you can transcode JPEG to AVIF without going through (lossy?) RGB conversion? YUV planes work the same way in both formats!
Fast sandboxing
So, what exactly makes this the worlds fastest sandbox? Well, for one it doesn’t use any I/O, doesn’t use any drivers and no virtual devices so it shakes off a common problem with other KVM solutions: Virtualized I/O reduces performance somewhat.
TinyKVM also focuses heavily on hugepages, even if they are not available in the host, and it will still benefit from the fewer page walks. When backed by real hugepages, we immediately see large gains. I have measured a large CPU-based LLM workload against one running inside TinyKVM, with the same seed and settings, and correcting for setup I found that TinyKVM ran at 99.7% native speed. I don’t know if that means that virtualization has a minimum overhead of 0.3% or if it is just a statistical difference when running for so long. But I’m happy that the difference is small enough not to matter.
It also completes function calls into the guest (VM calls) fairly fast, meaning we’re not really adding any overheads anywhere. We’re really just running on the CPU, which is what we want. So, provided we have zero-copy solutions for accessing data, we’re just processing it and passing it back in a safe way.
In short, we’re just processing data at the speed of the native CPU, and that’s exactly what we want. To avoid the overheads related to classical full-system virtualization and enjoy our new lane without any at all.
Drawbacks
It’s not possible to reduce vCPU count after increasing it in the KVM API, and because of this I consider multi-processing something that can be better achieved by running more VMs concurrently and just using/abusing the automatic memory sharing. TinyKVM does have experimental multi-processing support but you can’t wind down after, so it’s not a great choice for long-running processes like Varnish.
There are multiple ways to work around this, which I won’t go into detail about here, but I’ll just mention that you can re-use a VM by resetting it to another, different VM. Costly, but opens the door for cool solutions like a shared pool of VMs.
Future work
- Intel TDX/AMD SEV support would be nice to have.
- AArch64 port.
- There is a KVM feature to lock down memory in a way where not even kernel mode in the guest can change it called KVM_MEM_READONLY which I hope can be a part of locking down the guest even more. A potential drawback is increased usage of memory mappings.
- The user-facing API in TinyKVM needs work to become friendly.
- Move much of the system call emulation that I’ve written for a Varnish integration into TinyKVM proper, which paves the way further for dynamic linker loading. ld.so uses mmap() with fd’s to map in files that it is loading. So, if you load ld.so into your own emulator, and then you add your intended program as the first argv argument, and its arguments after that, ld.so will load your program for you, provided you have that mmap() implementation. This is what libriscv does!
Conclusion
TinyKVM perhaps surprisingly places itself among the smallest serious sandboxing solutions out there, and may also be the fastest. It takes security seriously and tries to avoid complex guest features and kernel mode in general. TinyKVM has a minimal attack surface and no ambition to grow further in complexity, outside of nicer user-facing APIs and ports to other architectures.
Have a look at the code repository for TinyKVM, if you’re interested. Documentation and user-facing API needs a lot of work still, but feel free to build the project and run some simple Linux programs with it. As long as they are static and don’t need file or network access, they might just run out-of-the box.
Not too long from now we will open-source a VMOD in Varnish that lets you transform data safely using TinyKVM. It works on both open-source Varnish and Varnish Enterprise! If you'd like to learn more, reach out to us today!
Blog originally posted here.