In my previous blog I explained how a virtual machine (VM) manages memory behind the back of the operating system. Let’s face it, it’s all based on a lie perpetrated by the hypervisor. Every time the operating system is on the verge of discovering the lie, it is knocked unconscious by the hypervisor who then fixes up the OS’s version of reality. There is a scene in the movie Barnyard that graphically illustrates this principle: The farmer is about to discover that farm animals are intelligent when he’s kicked in the head by the mule.

To implement that mechanism, some help from the hardware is needed. The idea is that the execution of certain instructions or access to certain resources will cause a hardware trap that can be processed by the hypervisor. It’s the same principle that is used by the OS to create virtual reality for user processes.

Rings of Protection

The x86 chip if full of vestigial organs, relics of some long forgotten trends in operating system design. One of them is the segmented memory architecture, another is the four rings of protection. Three of the rings, 1, 2 and 3, correspond to user mode, while the OS kernel lives in ring 0. In practice, operating systems use just two rings, usually rings 3 and 0, to separate user processes from the kernel and from each other.

The hardware makes this illusion possible by making some processor instructions available only in ring 0. An attempt to execute them in user mode will result in a fault. Protected instructions include those that modify control registers, in particular the page directory register CR3 and the Interrupt Descriptor Table Register, IDTR. (There are also sensitive instructions that can only be executed with the appropriate I/O privilege level — those deal with I/O and interrupt flags.) The OS, having exclusive access to privileged instructions can set up page tables and interrupt handlers — including the page-fault handler. With those, it can hide its own data structures (including said page tables and the IDT) and its code from user processes.

The OS switches from kernel mode to user mode by setting the lowest bit in the CR0 register. But since access to control registers doesn’t work in user mode, switching back to kernel can only be done through a fault or an interrupt.

Hardware Virtualization

With all these rings of protection you’d think that virtualization of the x86 should be a piece of cake. Just put the hypervisor in ring 0, the OS in ring 1, and user processes in ring 3. Unfortunately this doesn’t work, and for many years the x86 was considered non-virtualizable. I’ll describe later how VMware was able to make this work using some pretty amazing tricks. Fortunately, in around 2005, both Intel, with its VT-x, and AMD, with its AMD-V, introduced hardware support for virtualization. The idea was to further split ring 0 into two layers, host mode and guest mode — the hypervisor running in the host mode and the OS in the guest mode. (The two modes are also called root and non-root.)

In this model, the hypervisor has complete control over hardware and memory. Before transiting into guest mode to give control to the OS, it prepares a special Virtual Machine Control Block (VMCB), which will set the state of the guest. Besides storing copies of guest system registers, VMCB also includes control bits that specify conditions under which the host will trap (exit) into the hypervisor. For instance, the hypervisor usually wants to intercept: reads or writes to control registers, specific exceptions, I/O operations, etc. It may also instruct the processor to use nested page tables. Once the VMCB is ready, the hypervisor executes vmrun and lets the OS take control. The OS runs in ring 0 (guest mode).

When any of the exit conditions that were specified in the VMCB are met, the processor exits into the hypervisor (host mode), storing the current state of the processor and the reason for the exit back into the VMCB.

The beauty of this system is that the hypervisor’s intervention is (almost) totally transparent to the OS. In fact a thin hypervisor was used by Joanna Rutkowska in a successful attack on Windows Vista that stirred a lot of controversy at the 2006 Black Hat conference in Las Vegas. Borrowing from The Matrix, Rutkowska called her rootkit Blue Pill. I’ll talk more about thin hypervisors in my next post.

Software Virtualization

Before there was hadware support for virtualization in the x86 architecture, VMware created a tour de force software implementation of VM. The idea was to use the protection ring system of the x86 and run the operating system in ring 1 (or sometimes 3) instead of ring 0. Since ring 1 doesn’t have kernel privileges, most of the protected things the OS tries to do would cause faults that could be vectored into the hypervisor.

Unfortunately there are some kernel instructions that, instead of faulting in user mode, fail quietly. The notorious example is the popf instruction that loads processor flags from the stack. The x86 flag register is a kitchen sink of bits, some of them used in arithmetic (sign flag, carry flag, etc.), others controlling the system (like I/O Privilege Level, IOPL). Only kernel code should be able to modify IOPL. However, when popf is called from user mode, it doesn’t trap — the IOPL bits are quietly ignored. Obviously this wreaks havoc on the operation of the OS when it’s run in user mode.

The solution is to modify the OS code, replacing all occurrences of popf with hypervisor calls. That’s what VMWere did — sort of. Since they didn’t have the source code to all operating systems, they had to do the translation in binary and on the fly. Instead of executing the OS binaries directly, they redirected the stream of instructions to their binary translator, which methodically scanned it for the likes of popf, and produced a new stream of instructions with appropriate traps, which was then sent to the processor. Of course, once you get on the path of binary translation, you have to worry about things like the targets of all jumps and branches having to be adjusted on the fly. You have to divide code into basic blocks and then patch all the jumps and so on.

The amazing thing is that all this was done with very small slowdown. In fact when VMware started replacing this system with the newly available VT-x and AMD-V, the binary translation initially performed better than hardware virtualization (this has changed since).

Future Directions

I’ve been describing machine virtualization as a natural extension of process virtualization. But why stop at just two levels? Is it possible to virtualize a virtual machine? Not only is it possible, but in some cases it’s highly desirable. Take the case of Windows 7 and its emulation of Windows XP that uses Virtual PC. It should be possible to run this combo inside a virtual box alongside, say, Linux.

There is also a trend to use specialized thin hypervisors for debugging (e.g., Corensic’s Jinx) or security (McAfee’s DeepSAFE). Such programs should be able to work together. Although there are techniques for implementing nested virtualization on the x86, there is still no agreement on the protocol that would allow different hypervisors to cooperate.

In the next installment I’ll talk about thin hypervisors.

Bibliography

  1. Gerald J. Popek, Robert P. Goldberg, Formal Requirements for Virtualizable Third Generation Architectures. A set of formal requirements that make virtualization possible.
  2. James Smith, Ravi Nair, The Architecture of Virtual Machines
  3. Ole Agesen, Alex Garthwaite, Jeffrey Sheldon, Pratap Subrahmanyam, The Evolution of an x86 Virtual Machine Monitor
  4. Keith Adams, Ole Agesen, A Comparison of Software and Hardware Techniques for x86 Virtualization
  5. Paul Barham et al., Xen and the Art of Virtualization.
  6. Muli Ben-Yehuda et al., The Turtles Project: Design and Implementation of Nested Virtualization
Advertisements