In my previous blog I explained how a virtual machine (VM) manages memory behind the back of the operating system. Let’s face it, it’s all based on a lie perpetrated by the hypervisor. Every time the operating system is on the verge of discovering the lie, it is knocked unconscious by the hypervisor who then fixes up the OS’s version of reality. There is a scene in the movie Barnyard that graphically illustrates this principle: The farmer is about to discover that farm animals are intelligent when he’s kicked in the head by the mule.
To implement that mechanism, some help from the hardware is needed. The idea is that the execution of certain instructions or access to certain resources will cause a hardware trap that can be processed by the hypervisor. It’s the same principle that is used by the OS to create virtual reality for user processes.
Rings of Protection
The x86 chip if full of vestigial organs, relics of some long forgotten trends in operating system design. One of them is the segmented memory architecture, another is the four rings of protection. Three of the rings, 1, 2 and 3, correspond to user mode, while the OS kernel lives in ring 0. In practice, operating systems use just two rings, usually rings 3 and 0, to separate user processes from the kernel and from each other.
The hardware makes this illusion possible by making some processor instructions available only in ring 0. An attempt to execute them in user mode will result in a fault. Protected instructions include those that modify control registers, in particular the page directory register CR3 and the Interrupt Descriptor Table Register, IDTR. (There are also sensitive instructions that can only be executed with the appropriate I/O privilege level — those deal with I/O and interrupt flags.) The OS, having exclusive access to privileged instructions can set up page tables and interrupt handlers — including the page-fault handler. With those, it can hide its own data structures (including said page tables and the IDT) and its code from user processes.
The OS switches from kernel mode to user mode by setting the lowest bit in the CR0 register. But since access to control registers doesn’t work in user mode, switching back to kernel can only be done through a fault or an interrupt.
Hardware Virtualization
With all these rings of protection you’d think that virtualization of the x86 should be a piece of cake. Just put the hypervisor in ring 0, the OS in ring 1, and user processes in ring 3. Unfortunately this doesn’t work, and for many years the x86 was considered non-virtualizable. I’ll describe later how VMware was able to make this work using some pretty amazing tricks. Fortunately, in around 2005, both Intel, with its VT-x, and AMD, with its AMD-V, introduced hardware support for virtualization. The idea was to further split ring 0 into two layers, host mode and guest mode — the hypervisor running in the host mode and the OS in the guest mode. (The two modes are also called root and non-root.)
In this model, the hypervisor has complete control over hardware and memory. Before transiting into guest mode to give control to the OS, it prepares a special Virtual Machine Control Block (VMCB), which will set the state of the guest. Besides storing copies of guest system registers, VMCB also includes control bits that specify conditions under which the host will trap (exit) into the hypervisor. For instance, the hypervisor usually wants to intercept: reads or writes to control registers, specific exceptions, I/O operations, etc. It may also instruct the processor to use nested page tables. Once the VMCB is ready, the hypervisor executes vmrun
and lets the OS take control. The OS runs in ring 0 (guest mode).
When any of the exit conditions that were specified in the VMCB are met, the processor exits into the hypervisor (host mode), storing the current state of the processor and the reason for the exit back into the VMCB.
The beauty of this system is that the hypervisor’s intervention is (almost) totally transparent to the OS. In fact a thin hypervisor was used by Joanna Rutkowska in a successful attack on Windows Vista that stirred a lot of controversy at the 2006 Black Hat conference in Las Vegas. Borrowing from The Matrix, Rutkowska called her rootkit Blue Pill. I’ll talk more about thin hypervisors in my next post.
Software Virtualization
Before there was hadware support for virtualization in the x86 architecture, VMware created a tour de force software implementation of VM. The idea was to use the protection ring system of the x86 and run the operating system in ring 1 (or sometimes 3) instead of ring 0. Since ring 1 doesn’t have kernel privileges, most of the protected things the OS tries to do would cause faults that could be vectored into the hypervisor.
Unfortunately there are some kernel instructions that, instead of faulting in user mode, fail quietly. The notorious example is the popf
instruction that loads processor flags from the stack. The x86 flag register is a kitchen sink of bits, some of them used in arithmetic (sign flag, carry flag, etc.), others controlling the system (like I/O Privilege Level, IOPL). Only kernel code should be able to modify IOPL. However, when popf
is called from user mode, it doesn’t trap — the IOPL bits are quietly ignored. Obviously this wreaks havoc on the operation of the OS when it’s run in user mode.
The solution is to modify the OS code, replacing all occurrences of popf
with hypervisor calls. That’s what VMWere did — sort of. Since they didn’t have the source code to all operating systems, they had to do the translation in binary and on the fly. Instead of executing the OS binaries directly, they redirected the stream of instructions to their binary translator, which methodically scanned it for the likes of popf
, and produced a new stream of instructions with appropriate traps, which was then sent to the processor. Of course, once you get on the path of binary translation, you have to worry about things like the targets of all jumps and branches having to be adjusted on the fly. You have to divide code into basic blocks and then patch all the jumps and so on.
The amazing thing is that all this was done with very small slowdown. In fact when VMware started replacing this system with the newly available VT-x and AMD-V, the binary translation initially performed better than hardware virtualization (this has changed since).
Future Directions
I’ve been describing machine virtualization as a natural extension of process virtualization. But why stop at just two levels? Is it possible to virtualize a virtual machine? Not only is it possible, but in some cases it’s highly desirable. Take the case of Windows 7 and its emulation of Windows XP that uses Virtual PC. It should be possible to run this combo inside a virtual box alongside, say, Linux.
There is also a trend to use specialized thin hypervisors for debugging (e.g., Corensic’s Jinx) or security (McAfee’s DeepSAFE). Such programs should be able to work together. Although there are techniques for implementing nested virtualization on the x86, there is still no agreement on the protocol that would allow different hypervisors to cooperate.
In the next installment I’ll talk about thin hypervisors.
Bibliography
- Gerald J. Popek, Robert P. Goldberg, Formal Requirements for Virtualizable Third Generation Architectures. A set of formal requirements that make virtualization possible.
- James Smith, Ravi Nair, The Architecture of Virtual Machines
- Ole Agesen, Alex Garthwaite, Jeffrey Sheldon, Pratap Subrahmanyam, The Evolution of an x86 Virtual Machine Monitor
- Keith Adams, Ole Agesen, A Comparison of Software and Hardware Techniques for x86 Virtualization
- Paul Barham et al., Xen and the Art of Virtualization.
- Muli Ben-Yehuda et al., The Turtles Project: Design and Implementation of Nested Virtualization
December 12, 2011 at 3:45 pm
Very interesting read. Looking forward to the next installment.
December 12, 2011 at 10:01 pm
Interesting. I’ve always had a hunch that some kind of translation was going on, but most of my knowledge is limited to attempting to run a hackintosh a long ways back.. My knowledge of machine language and architecture has improved since then, and your explanation of the rings and hypervisor were very helpful and insightful.
Please do tell us more, I’ll be subscribing after I post this.
December 13, 2011 at 10:39 pm
Awesomeness….
Simple description of how things works…thanks a lot.
It is nice to know that the concepts appear to match common sense…easy to understand (of course the devil is in the details 🙂 as you mentioned about how VM handled it in their software….pretty truly amazing
December 15, 2011 at 12:37 pm
Very interesting read. Thank you for sharing.
Are you aware of any big differences between say wmware player (I think it is called that) and virtualbox? Last time I compared the two (2009) there were significant difference
January 3, 2012 at 10:00 pm
[…] It’s a very promising technology that allows you to virtualize a running OS on demand. My previous blog entry is a recommended reading before this one. It explains how the hypervisor interacts with the […]
July 31, 2012 at 8:00 am
Hey, you have presented the subject in a very wonderful way. Could you please give another link for your ‘previous blog’. I would like to read it to clear up some concepts
July 31, 2012 at 4:16 pm
The first blog was Virtual Machines: Memory and the second was
Virtualizing Virtual Memory. Enjoy!
August 4, 2012 at 10:40 am
Thanks a ton!
May 29, 2013 at 1:24 am
Do you mind if I quote a couple of your articles or blog posts as long as I provide credit and sources
returning to your site: https://corensic.wordpress.com/2011/12/12/virtual-machines-the-traps/.
I will aslo be sure to give you the proper anchor-text hyperlink
using your webpage title: Virtual Machines: The Traps | Corensic.
Please be sure to let me know if this is acceptable with you.
Thankyou
May 29, 2013 at 12:04 pm
@chinook: Absolutely! Feel free to quote these posts.
October 7, 2015 at 6:53 pm
I have a good question .. normal kernel run on ring 0 , but in hypervisor mode it runs in ring 1. who is mapping ring 0 to ring 1 while kerrnel is running in shadow page table case and EPT/NPT case hypervisor ? .. In case of binary translation i understood that binary codde is searched for privileged code and modify it and give the kernel runs in guest mode or ring 1.
Also I understand that in hypervisor mode kernel and application runs in application as process … then how if user mode page fault occur how kernel running in user mode handle the page fault ? .. mean generating guest physical address ? i am not getting these concepts ..could you explain it ?