Writing a hypervisor for a virtual machine requires a lot of effort and is comparable in complexity with writing an operating system.

A standalone OS must support a lot of functionality and a large number of hardware configurations. However, it’s not really that hard to write a stripped down kernel. Imagine that you only care about the memory subsystem and a subset of interrupts, but don’t have to run more than one process at a time, and you don’t have to deal with I/O devices. I bet you could write such a system in a few weeks (and spend a few years debugging it;-).

Likewise, writing a full-blown virtual machine hypervisor that can multiplex several OSs and interact with various hardware configurations requires a lot of effort. In particular, a hypervisor must be able to virtualize a large variety of devices. But a stripped down hypervisor that deals with one OS at a time, and doesn’t monitor any devices is within reach of a small dedicated team of systems programmers (such as our team at Corensic). And there are plenty of applications for such a thin hypervisor, applications which have nothing to do with cramming multiple operating systems into one box.

A thin hypervisor can transparently hook into many low level activities, whether to monitor or to emulate them. Because a thin hypervisor is much smaller than the OS, it’s also easier to verify its code and incorporate it in the trusted computing base (TCB). McAfee’s DeepSAFE technology relies on this property to protect the system from security threats. Corensic Jinx uses a thin hypervisor to monitor patterns of memory access to detect concurrency bugs. I’m sure there will be many more uses for thin hypervisors once the word gets out that they are feasible to implement.

I will concentrate on the architecture of the Jinx thin hypervisor, since I know it better than others, and because many implementation ideas are common to most thin hypervisors. In my previous blog I described the basics of virtualization so please read it if you’re not familiar with some of the terms I use here.

A full-blown “thick” hypervisor is really an operating system and, like any OS, is loaded during bootstrap. Being the first system to boot has its advantages: A thick hypervisor can start lying to the guest OS from the very beginning. When it loads the guest OS, it can cheat about the type of a processor, the number of cores, the size of physical memory, the presence and capabilities of I/O devices, etc. It’s quite common for a thick hypervisors to present to the OS a virtual machine with very limited capabilities. This way it can provide virtual drivers for a standard subset of devices rather than support all the latest and greatest.

In contrast, a thin hypervisor is inserted after the operating system has fully booted. So the OS knows very well what hardware it’s running on, and the opportunities for cheating are much more limited.

Loading the Thin Hypervisor

How does the thin hypervisor sneak under the operating system? The OS always starts in ring 0 but, when virtualization is enabled in the BIOS, the operating system starts in the host mode of ring 0 (as opposed to the guest mode — see my previous blog). The OS has the highest privilege you can get. So the question is not how to shimmy the hypervisor under the OS but rather how to dislodge the OS from host mode.

This can only be done by code that runs inside the kernel with the highest privilege level. And the only way to install application code in the kernel is in the form of a device driver. This is why, when installing a thin hypervisor like Jinx, the user needs to elevate his or her security level. In the case of Windows, the driver needs to be signed by Microsoft.

The driver sets up the VMCB (Virtual Machine Control Block — see my previous blog) and prepares handlers for particular events that are to be trapped by the hypervisor. It sets the instruction pointer in the VMCB to point to a specific offset in the driver code and executes VMRUN. The processor switches to guest mode and jumps to the specified offset. From that point on, the OS, including the driver, runs in guest mode until it traps to the hypervisor and executes one of the handlers in host mode.

The opposite procedure — the unloading of the hypervisor — is also easy. A thin hypervisor may jump in and out from under the OS at any time.

If inserting a hypervisor is so simple, you might be wondering if the same procedure couldn’t also be used to install an almost undetectable virus on somebody’s machine. Indeed, as I already mentioned in my previous blog, Joanna Rutkowska, a security researcher from Poland, shocked the community by demonstrating such an attack called Blue Pill on Windows Vista. Part of the attack was installing an unsigned device driver which, as it turned out, was possible on Vista RC1 (it was done by forcing existing drivers to be paged out and then modifying their code directly in the page file). This security hole has since been closed, but still, the specter of a thin hypervisor attack led most hardware manufacturers to, by default, disable the virtualization in the BIOS. This is why you don’t see malware using this kind of attack (knock on wood!). But, as a side effect of this defensive behavior, the first time installation of Jinx might require the enabling of virtualization in the BIOS. You have to do it during the boot sequence — a minor nuissance.

Communicating with the Hypervisor

The way you talk to the hypervisor, either from user mode or from the driver, is a bit tricky. We decided to trap the CPUID instruction into the hypervisor. It’s a rarely used unprotected instruction and it’s not performance sensitive (that is, nobody calls it in a tight loop a million times). CPUID returns various pieces of info about the processor, depending on the value stored in EAX. Only a few values are meaningful, the rest are up for grabs. Our hypervisor sets a trap for CPUID and inspects EAX for a special value. For all other values, it forwards CPUID to the physical processor.

The communication back from the hypervisor is easy since the hypervisor has complete control over virtual memory and can map any data into user’s or driver’s address space.

Passive and Active Mode

Jinx uses the hypervisor for stochastic testing of a running program. It lets the program run without interference most of the time and periodically, for short intervals, turns to active mode. This way Jinx doesn’t noticeably slow down the running program, which is important for realistic testing.

The Jinx hypervisor starts in passive mode, in which it does nothing other than service CPUID traps (actually it is also sensitive to nonmaskable interrupts, NMIs, which I’ll discuss later). But when the hypervisor gets the prearranged signal, it switches to active mode.

In active mode, Jinx is mostly interested in tracking memory accesses. This requires trapping page faults, CR3 accesses, and attempts at I/O. As I mentioned before, thin hypervisors don’t provide their own device drivers, so they let the OS deal with I/O using the OS’s own drivers. Jinx doesn’t even do that. It intercepts all attempts at communication and simply stalls the corresponding thread until the end of its short testing period.

Since Jinx does stochastic sampling, it frequently switches between passive and active modes. During normal operation — sampling a program for concurrency bugs — the entry to active mode is triggered by the timer, which sends a non-maskable interrupt (NMI) to the OS. This is why our hypervisor traps NMIs when it’s in passive mode.

Managing Memory

Since a thin hypervisor doesn’t have to multiplex operating systems, it can use a much simplified scheme for memory management. For instance, it may make host physical addresses the same as guest physical addresses. Or, like the Corensic hypervisor, use a copy on write scheme. The pages that are only read use identity mapping from guest physical to host physical. But as soon as a page is written to, the Jinx hypervisor (inside the page-fault handler) makes a copy of it and maps its guest physical address to the host physical address of the copy.

This is how Jinx is able to virtualize the execution of the program during active periods. As long as there is no communication with external devices, Jinx simulations don’t modify any of the original memory pages. They can be restarted and deterministically re-run using the logs of memory accesses that are kept by Jinx. This is the technology that enables Jinx’s algorithms to detect data races (for more information see Pete Godman’s post).

Devices

During a virtual run, as soon as the debugged program tries to access an external device, Jinx has to do something. Otherwise the program’s virtual state would become observable and — if you like analogies with Quantum Mechanics — its wave function would collapse. To prevent that, Jinx blocks the thread that is trying to do I/O until the end of the simulation. When the program is finally run in the real world (as opposed to virtual execution), it will perform the stalled I/O.

In order to implement this behavior, a thin hypervisor must intercept all I/O attempts. Those involve the in and out instructions, interrupts, and the hardest one — memory mapped I/O.

Conclusion

I like to think of a hypervisor as an extension of hardware. It lets you observe the state and modify the behavior of CPUs, memory, and devices. This can be done through hardware, for example using ICE (In-Circuit Emulator), but the software approach is more flexible.

A hypervisor offers a unique ability to look at the executing program and the OS from the hardware perspective. You can not only observe their interactions but also modify the behavior of both software and hardware on the fly. You can even build an interpreter of the instruction stream directed at the processors. Or, like Jinx, you can virtualize and replay the executions of small intervals of a program.

Acknowledgments

I’d like to thank the members of the Corensic team for explaining to me the workings of the Jinx hypervisor and for feedback on this blog.

Bibliography

  1. Joanna Rutkowska, Subverting Vista Kernel for Fun and Profit
  2. Heradon Douglas, Thin Hypervisor-Based Security Architectures for Embedded Platforms
  3. Takahiro Shinagawa et. al., BitVisor: A Thin Hypervisor for Enforcing I/O Device Security
  4. David E. Lowell, Yasushi Saito, and Eileen J. Samberg, Devirtualizable Virtual Machines Enabling General, Single-Node, Online Maintenance

The WordPress.com stats helper monkeys prepared a 2011 annual report for this blog.

Here’s an excerpt:

The concert hall at the Syndey Opera House holds 2,700 people. This blog was viewed about 46,000 times in 2011. If it were a concert at Sydney Opera House, it would take about 17 sold-out performances for that many people to see it.

Click here to see the complete report.

In my previous blog I explained how a virtual machine (VM) manages memory behind the back of the operating system. Let’s face it, it’s all based on a lie perpetrated by the hypervisor. Every time the operating system is on the verge of discovering the lie, it is knocked unconscious by the hypervisor who then fixes up the OS’s version of reality. There is a scene in the movie Barnyard that graphically illustrates this principle: The farmer is about to discover that farm animals are intelligent when he’s kicked in the head by the mule.

To implement that mechanism, some help from the hardware is needed. The idea is that the execution of certain instructions or access to certain resources will cause a hardware trap that can be processed by the hypervisor. It’s the same principle that is used by the OS to create virtual reality for user processes.

Rings of Protection

The x86 chip if full of vestigial organs, relics of some long forgotten trends in operating system design. One of them is the segmented memory architecture, another is the four rings of protection. Three of the rings, 1, 2 and 3, correspond to user mode, while the OS kernel lives in ring 0. In practice, operating systems use just two rings, usually rings 3 and 0, to separate user processes from the kernel and from each other.

The hardware makes this illusion possible by making some processor instructions available only in ring 0. An attempt to execute them in user mode will result in a fault. Protected instructions include those that modify control registers, in particular the page directory register CR3 and the Interrupt Descriptor Table Register, IDTR. (There are also sensitive instructions that can only be executed with the appropriate I/O privilege level — those deal with I/O and interrupt flags.) The OS, having exclusive access to privileged instructions can set up page tables and interrupt handlers — including the page-fault handler. With those, it can hide its own data structures (including said page tables and the IDT) and its code from user processes.

The OS switches from kernel mode to user mode by setting the lowest bit in the CR0 register. But since access to control registers doesn’t work in user mode, switching back to kernel can only be done through a fault or an interrupt.

Hardware Virtualization

With all these rings of protection you’d think that virtualization of the x86 should be a piece of cake. Just put the hypervisor in ring 0, the OS in ring 1, and user processes in ring 3. Unfortunately this doesn’t work, and for many years the x86 was considered non-virtualizable. I’ll describe later how VMware was able to make this work using some pretty amazing tricks. Fortunately, in around 2005, both Intel, with its VT-x, and AMD, with its AMD-V, introduced hardware support for virtualization. The idea was to further split ring 0 into two layers, host mode and guest mode — the hypervisor running in the host mode and the OS in the guest mode. (The two modes are also called root and non-root.)

In this model, the hypervisor has complete control over hardware and memory. Before transiting into guest mode to give control to the OS, it prepares a special Virtual Machine Control Block (VMCB), which will set the state of the guest. Besides storing copies of guest system registers, VMCB also includes control bits that specify conditions under which the host will trap (exit) into the hypervisor. For instance, the hypervisor usually wants to intercept: reads or writes to control registers, specific exceptions, I/O operations, etc. It may also instruct the processor to use nested page tables. Once the VMCB is ready, the hypervisor executes vmrun and lets the OS take control. The OS runs in ring 0 (guest mode).

When any of the exit conditions that were specified in the VMCB are met, the processor exits into the hypervisor (host mode), storing the current state of the processor and the reason for the exit back into the VMCB.

The beauty of this system is that the hypervisor’s intervention is (almost) totally transparent to the OS. In fact a thin hypervisor was used by Joanna Rutkowska in a successful attack on Windows Vista that stirred a lot of controversy at the 2006 Black Hat conference in Las Vegas. Borrowing from The Matrix, Rutkowska called her rootkit Blue Pill. I’ll talk more about thin hypervisors in my next post.

Software Virtualization

Before there was hadware support for virtualization in the x86 architecture, VMware created a tour de force software implementation of VM. The idea was to use the protection ring system of the x86 and run the operating system in ring 1 (or sometimes 3) instead of ring 0. Since ring 1 doesn’t have kernel privileges, most of the protected things the OS tries to do would cause faults that could be vectored into the hypervisor.

Unfortunately there are some kernel instructions that, instead of faulting in user mode, fail quietly. The notorious example is the popf instruction that loads processor flags from the stack. The x86 flag register is a kitchen sink of bits, some of them used in arithmetic (sign flag, carry flag, etc.), others controlling the system (like I/O Privilege Level, IOPL). Only kernel code should be able to modify IOPL. However, when popf is called from user mode, it doesn’t trap — the IOPL bits are quietly ignored. Obviously this wreaks havoc on the operation of the OS when it’s run in user mode.

The solution is to modify the OS code, replacing all occurrences of popf with hypervisor calls. That’s what VMWere did — sort of. Since they didn’t have the source code to all operating systems, they had to do the translation in binary and on the fly. Instead of executing the OS binaries directly, they redirected the stream of instructions to their binary translator, which methodically scanned it for the likes of popf, and produced a new stream of instructions with appropriate traps, which was then sent to the processor. Of course, once you get on the path of binary translation, you have to worry about things like the targets of all jumps and branches having to be adjusted on the fly. You have to divide code into basic blocks and then patch all the jumps and so on.

The amazing thing is that all this was done with very small slowdown. In fact when VMware started replacing this system with the newly available VT-x and AMD-V, the binary translation initially performed better than hardware virtualization (this has changed since).

Future Directions

I’ve been describing machine virtualization as a natural extension of process virtualization. But why stop at just two levels? Is it possible to virtualize a virtual machine? Not only is it possible, but in some cases it’s highly desirable. Take the case of Windows 7 and its emulation of Windows XP that uses Virtual PC. It should be possible to run this combo inside a virtual box alongside, say, Linux.

There is also a trend to use specialized thin hypervisors for debugging (e.g., Corensic’s Jinx) or security (McAfee’s DeepSAFE). Such programs should be able to work together. Although there are techniques for implementing nested virtualization on the x86, there is still no agreement on the protocol that would allow different hypervisors to cooperate.

In the next installment I’ll talk about thin hypervisors.

Bibliography

  1. Gerald J. Popek, Robert P. Goldberg, Formal Requirements for Virtualizable Third Generation Architectures. A set of formal requirements that make virtualization possible.
  2. James Smith, Ravi Nair, The Architecture of Virtual Machines
  3. Ole Agesen, Alex Garthwaite, Jeffrey Sheldon, Pratap Subrahmanyam, The Evolution of an x86 Virtual Machine Monitor
  4. Keith Adams, Ole Agesen, A Comparison of Software and Hardware Techniques for x86 Virtualization
  5. Paul Barham et al., Xen and the Art of Virtualization.
  6. Muli Ben-Yehuda et al., The Turtles Project: Design and Implementation of Nested Virtualization

Just like the OS creates virtual reality for user processes, the hypervisor creates virtual reality for operating systems. In my last blog I talked about virtual memory. I’ll continue by describing how the hypervisor further virtualizes the memory seen by the operating system.

The operating system is responsible for allocating physical pages and mapping them to virtual addresses. When the operating system boots, it asks the hardware how much physical memory there is. If your machine has 4GB of RAM, the OS will know that it can allocate a million or so 4kB pages starting at offset 0. If you want to run more than one OS on the same hardware, the hypervisor will have to provide each of the guest OSs with the illusion of a more or less contiguous block of physical memory starting at address 0, because that’s what all OSs expect.

The simplest way to do that is to introduce one more level of virtualization — one more level of mapping between what the guest OS considers a physical address (it’s called guest physical address) and the actual physical address (which is called host physical address). In other words, a virtual address is first translated to the guest physical address, and then to the host physical address — the latter translation controlled by the hypervisor.

On the left, two processes are running in the operating system OS 1, which allocates guest physical addresses from its own pool. On the right, a second operating system, OS 2, allocates addresses from another pool. The hypervisor maps guest physical addresses from both pools onto one pool of host physical addresses -- the actual DRAM.

There are several ways of implementing this scheme.

Nested Page Tables

Address translation on the x86 is implemented in hardware, with page tables stored in regular memory that is directly accessible to the OS kernel. It’s not easy to hook an additional level of indirection into this machinery. In recent years Intel and AMD started producing chips that provide hardware support for the second layer of virtualization. AMD calls its technology nested page tables (NPT), and Intel calls theirs extended page tables (EPT). Unfortunately those two differ in some important details — e.g., EPT doesn’t support accessed/dirty bits.

The usual page table walk, which happens when the mapping is not present in the TLB, goes through the guest page tables and continues into host page tables, until the physical address if found. This address then goes into the TLB to speed up future lookups. This happens in hardware without any intervention from the hypervisor.

The hypervisor has to intervene only when a page fault happens. It’s a two stage process: First the operating system processes the fault. It allocates guest physical memory from its (fake) pool and stores it in the guest page table. Remember, the OS knows nothing about the second level of indirection — it’s just your regular Linux or Windows. When the user instruction that caused the fault is restarted, address translation progresses half way, and faults again when it can’t find the guest-to-host mapping. This time the fault is vectored into the hypervisor, which allocates actual physical memory and patches the rest of the page tables.

Old generations chips don’t support NTP so a different set of software tricks is used.

Shadow Page Tables

In the absence of nested page table support, there is another cheat that’s used by virtual machines. The idea is to create a second copy of page tables called shadow page tables that map virtual addresses directly to host physical addresses (actual DRAM), and let the processor use them for address translation (so the CR3 register points to those, rather than to guest, or primary page tables).

Primary and shadow page tables

The structure of shadow page tables that are kept by the hypervisor reflects the structure of primary page tables that are kept by the operating system. There are several approaches to keeping those two structures in sync.

The naive approach would be to let the OS freely modify its page tables, e.g., in response to a page fault. The hypervisor would only intercept the privileged instruction INVLPG that the OS uses to invalidate a TLB entry. At that point the hypervisor would make a corresponding change to its shadow page tables, substituting the guest physical address with the corresponding host physical address (possibly allocating a physical page from its own pool). Unfortunately, this doesn’t work because some operating systems (I won’t be pointing fingers) occasionally forget to invalidate TLB entries.

The fallback approach is for the hypervisor to write-protect all guest page tables. Whenever the OS tries to modify them, a page fault occurs — the so called tracing fault — which is vectored into the hypervisor. That gives the hypervisor the opportunity to immediately update its shadow page tables. In this scheme the two structures always mirror each other (modulo physical addresses) but at the cost of a large number of page faults.

A common optimization is to allow shadow page tables to be lazily updated. This is possible if not all primary page tables are write protected. If an entry is present in the primary page table but is missing from the shadow page table, a page fault will occur when the hardware walks it for the first time. This is a so called hidden page fault and it’s serviced by the hypervisor, which at that point creates an entry in its shadow page tables based on the entry in the primary page tables.

Since the hypervisor must be involved in the processing of some of the page faults, it must install its own page handler. That makes all faults, real and hidden, vector into the hypervisor. A real page fault happens when the entry is missing (or is otherwise protected) in the primary page tables. Such faults are reflected back to the OS — the OS knows what to do with them. Faults that are caused by the OS manipulating page tables, or by the absence of an entry in the shadow page tables are serviced by the hypervisor.

An additional complication is related to the A/D (accessed and dirty) bits stored in page tables. Normally, whenever a page of memory is accessed, the processor sets the A bit in the corresponding PTE (Page Table Entry); and when the page is written, it also sets the D bit. These bits are used by the OS in its page replacement policy. For instance, if a page hasn’t been dirtied, it doesn’t have to be written back to disk when it’s unmapped. In a virtualized system, the processor will of course mark the shadow PTEs, and those bits have to somehow propagate to the primary page tables for the use by the OS. This can be accomplished by always starting host physical pages in the read-only mode. The hypervisor allocates a host physical page but it sets the read-only bit in its shadow PTE, which is used by the processor for address translation. Reading from this page proceeds normally, but as soon as the processor tries to write to it, there is a protection fault. The hypervisor intercepts this fault: It checks the guest page tables to make sure the page was supposed to be writable; if so, it sets the dirty bit in the guest PTE, and removes the write protection from the host PTE. After that, write access to this page may proceed without further faults.

The hypervisor keeps separate shadow page tables for each process. When a context switch occurs, the OS tries to modify the CR3 register to point to the primary page directory for the new process. CR3 modification is a protected instruction and it traps into the hypervisor and gives it the opportunity to switch the shadow tables and point the CR3 at them. The hypervisor might also chose to discard shadow page tables upon context switch, and later refill them on demand using hidden page faults.

Guest/Host Transitions

All these hypervisor tricks rely on the ability to trap certain operations performed by the operating system. The virtualization of virtual memory requires the trapping of page faults, TLB flushes, and CR3 register modifications. I’ll describe various methods to achieve this in my next blog post.

I’d like to thank David Dunn and Pete Godman for sharing their expertise with me.

Bibliography

  1. Johan De Gelas, Hardware Virtualization: the Nuts and Bolts
  2. Intel® Virtualization Technology
  3. AMD-V™ Nested Paging

There have been times when a computer could only run one application at a time. Nowadays even smartphones try to run multiple programs in parallel. The next logical step is to run multiple operating systems in parallel on the same machine. A modern PC has enough resource and speed to do it efficiently. Welcome to the era of virtual machines.

The VM technology has entered the mainstream in the recent years with Intel and AMD building support for it into their popular chips. If you have a computer that is less than five years old, there is a good chance that it supports virtualization in hardware. Even if you don’t feel the urge to run Linux and Windows at the same time on your laptop, virtualization has many other interesting uses that are only now being explored. Since Corensic’s own product, Jinx, uses virtualization for debugging concurrent programs, in this series of blog posts we will share our knowledge of VM technology with you.

The idea of a virtual machine is not hard to grasp. There is a perfect analogy for it: the relationship between processes and the operating system.

Imagine that you are a process: You live in virtual reality in which you have exclusive access to an almost arbitrary number of processors on which to run your threads. You also have access to a large swath of contiguous memory, limited only by your addressing capabilities (32- or 64-bit). And for all you know you are the only process in the universe.

The same way the OS can fool processes, a hypervisor can fool the OS. A hypervisor creates the illusion of a machine on which the OS operates–the virtual machine. And it does it in a very similar way.

In this series I will discuss different areas of virtualization, first by describing how they are implemented for processes by the OS, and then for the OS by the hypervisor. In this installment, I’ll talk about how the operating system virtualizes memory.

Virtual Memory

There are three major aspects of virtual memory:

  1. Providing a process with the illusion of large contiguous memory space
  2. Allowing multiple processes to run in parallel
  3. Hiding the operating system code and data from processes and hiding processes from each other

The illusion is created by the operating system with the help of hardware.

It’s the OS that creates a process that executes a program. It provides the process with means to access its 32-bit or 64-bit address space. It does it by mapping each address used by the program (a.k.a., virtual address) into an address in computer memory (a.k.a., physical address). So when the program dereferences a pointer, this pointer is translated into an actual address in memory.

Address Translation

Address translation is done using a table. Conceptually it’s a table that maps every virtual address to its corresponding physical address. In practice, the address space is divided into larger chunks called pages and the page table maps virtual pages into physical pages. It would still be a lot of data to map the whole virtual address space, so page tables are organized in a tree structure and only some branches of it are present in memory at any given time.

Let’s see how this works on a 32-bit x86. There is a top-level page directory table with entries that point to lower-level page tables. When it comes to translating a virtual address, the x86 takes the top 10 bits of the address and uses them as an offset into the directory table. How does it find the directory table? It stores the pointer to it in one of the special control registers, called CR3. An entry in the directory table contains a pointer to the corresponding page table. The x86 uses the next 10 bits of the virtual address to offset into that table. Finally, the entry in the page table contains the address of the physical page. The lowest 12 bits of the virtual address are then used as the offset into that page.

Virtual address decoding

This kind of table walking is relatively expensive, and it involves multiple memory accesses, so a caching mechanism is used to speed it up. The x86 has a special Translation Lookaside Buffer, TLB, in which most recently used page translations are stored. If the upper 20 bits of a given address are present in the TLB, address translation is very fast. In the case of a TLB miss, the processor does the table walk and stores the new translation in the TLB. On the x86, table walking is implemented in hardware which, in retrospect, turned out to be not such a great idea. We’ll talk about it more in the context of virtualization of virtual memory.

Page Fault

At this point I would like to say that virtual address space is much larger than the amount of physical memory normally available on a PC, but that’s no longer true. I have 4GB of memory in my laptop, which is the maximum amount addressable by a 32-bit processor. But then again, my laptop runs on a 64-bit processor, which can address exabytes of memory. In any case, virtual memory was designed to allow the addressing of more memory than is physically available to a particular process.

One way to look at it is that physical memory is nothing but a large cache for the main storage medium: the hard disk. Indeed, when a process asks the OS for memory, the OS reserves a chunk of disk space inside the swap file and creates some dummy entries for it in the page table. When the program first tries to access this memory, the processor goes through the usual hoops: it looks it up in the TLB–nothing there–then it starts walking the tables and finds the dummy entry. At this point it gives up and faults. Now it is up to the operating system to process the page fault. It has previously registered its handler with the processor and now this handler is executed.

This is what the page fault handler does:

  1. It allocates a page of physical memory. If there aren’t any free pages, it swaps an existing one to disk and invalidates its page entry.
  2. If the page had been previously swapped out, the handler reads the data from the page file to the newly allocated physical page
  3. It updates the page table and the TLB
  4. It restarts the faulting instruction

Notice that all this is totally transparent to the running process. This kind of trap and emulate mechanism forms the basis of all virtualization.

Isolation

Virtual memory is also used to run multiple processes on the same machine. Processes are defined in terms of their address spaces: unlike threads, processes can’t access each other’s memory. Every process has its own virtual address space with its own private page table. One process cannot peek at another process’s memory simply because there is no way it can address it.

At the hardware level, when the operating system assigns an x86 processor (or core) to a given process, it sets its CR3 register to point to that process’s page directory. In a multitasking environment this assignment changes with every context switch.

Setting one control register is not an expensive operation, but on an x86 it must be accompanied by a TLB flush, which results in new table walks to fill it again; and that might be quite expensive. This is why both Intel and AMD have recently added Address Space Identifiers (ASIDs) to TLB entries in their newest processors. TLBs with ASIDs don’t have to be flushed, the processor will just ignore the entries with the wrong ASIDs.

One of the important aspects of virtual memory is that it provides isolation between processes: one process can’t modify, or even see, another process’s memory. This not only prevents one buggy programs from trashing the whole system (a common occurrence in DOS or the first versions of Windows), but also improves overall security of the system.

Levels of Privilege

Full isolation would be impossible if a user program could directly manipulate page tables. This is something that only the trusted operating system should be able to do. For instance, a user program should not be able to set the value of CR3. This prohibition has to be enforced at the processor level, which designates certain instructions as privileged. Such instructions must be accessible from the kernel of the operating system, but cause a fault when a user process tries to execute them. The x86 must therefore be able to distinguished these two modes of operations: kernel (a.k.a., supervisor) and user. For historical reasons, the x86 offers four levels of protection called rings: ring 0 corresponding to kernel mode and rings 1, 2, and 3 to user mode. In practice, only ring 0 and 3 are used, except for some implementations of virtualization.

When the operating system boots, it starts in kernel mode, so it can set up the page tables and do other kernel-only things. When it starts an application, it creates a ring 3 process for it, complete with separate page tables. The application may drop back into the kernel only through a trap, for instance when a page fault occurs. A trap saves the state of the processor, switches the processor to kernel mode, and executes the handler that was registered by the OS.

In the next installment I’ll explain how the hypervisor fools the operating system by virtualizing virtual memory.

If you expected std::async to be just syntactic sugar over thread creation, you can stop reading right now, because that’s what it is. If you expected more, read on.

Don’t get me wrong, std::async combines several useful concurrency concepts into a nice package: It provides a std::future for the return value, and hides the std::promise side of the future. It also provides options to run a task synchronously. (See the Appendix for a short refresher.)

But tasks have a slightly different connotation in parallel programming: they are the basic blocks of task-based parallelism. And C++11 tasks fall short on that account.

Task-Based Parallelism

Tasks are an answer to performance and scalability problems associated with threads. Operating system threads are rather heavy-weight; it takes time and system resources to create a thread. If you have an algorithm that naturally decomposes into a large number of independent computations, a.k.a. tasks, you’ll probably get your best performance not by creating a separate thread for each task, but by adjusting the number of threads depending on the amount of parallelism available on your particular system, e.g., the number of cores and their current loads. This can be done by hand, using thread pools and load balancing algorithms; or by using task-based systems.

In a task-based system, the programmer specifies what can be run in parallel but lets the system decide how much parallelism to actually use. The programmer splits the algorithm into tasks and the runtime assigns them to threads — often many tasks to a thread.

There are many implementations of task-based parallelism with varying language support. There’s the Cilk language which pioneered this approach; there’s the built in support in Haskell, F#, and Scala; and there are several C++ libraries, like Microsoft PPL or Intel TBB.

Unlike thread creation, task creation is supposed to be relatively inexpensive, letting the programmer explore low-level granularity parallelism and take advantage of multicore speedups.

At the center of task-based systems are work-stealing queues. When a thread creates tasks, they are initially queued on the processor (core) that runs the thread. But if there are other idle processors, they will try to steal tasks from other queues. The stolen tasks are then run in parallel.

Notice that tasks must be able to migrate between threads. What’s more, efficient use of OS threads requires that tasks that are blocked, for instance waiting for I/O or waiting for other tasks to finish, should be taken off their threads, so that other tasks may reuse them.

C++11 Tasks

My expectation was that C++11 “tasks” that are created using std::async should be abstracted from threads, just as they are in task-based parallelism. When I started preparing a video tutorial about tasks in C++, I wrote a simple program to demonstrate it. I created async tasks using the default launch policy and waited for them to complete. Each task slept for one second and then printed its thread ID.

int main() 
{
    std::cout << "Main thread id: " << std::this_thread::get_id() 
        << std::endl;
    std::vector<std::future> futures;
    for (int i = 0; i < 20; ++i)
    {
        auto fut = std::async([]
        {
            std::this_thread::sleep_for(std::chrono::seconds(1));
            std::cout << std::this_thread::get_id() << " ";
        });
        futures.push_back(std::move(fut));
    }
    std::for_each(futures.begin(), futures.end(), [](std::future & fut)
    {
        fut.wait();
    });
    std::cout << std::endl;
}

The results were surprising. The first six tasks executed in parallel, each in its own thread. But the rest of the tasks executed in the main thread one after another, separated by 1 second intervals. (Note: this behavior was fixed in v 1.7 of Just::Thread — read on).

The output of the task test

This approach to parallelism obviously doesn’t scale very well.

Then I wrote another program that lists directories recursively, creating a separate async task for each subdirectory. But this time I explicitly requested launch policy launch::async, which guarantees that each task will start in a new thread. This program worked up to a point, but when I tried to list my whole disk, it failed by exhausting Windows’ limited thread creation capability. Again, this approach doesn’t scale very well.

What was even worse, when the program didn’t fail, it performed better with launch::deferred policy, which forces all tasks to be executed serially, than with the launch::async policy. That’s because thread creation in Windows is so expensive that it can easily nullify performance gains of parallelism (although Windows 7 supports user-level threads, which might bypass these problems).

My first reaction was to blame Anthony Williams for badly implementing the Just::Thread library I was using. When he assured me that it was Standard compliant, I turned to Herb Sutter and Hans Boehm for confirmation and they sided with Anthony. It turns out that there are serious problems that prevented C++11 from standardizing task-based concurrency.

The Problems

The foundation of task-based parallelism is the ability for tasks to share threads and to migrate between threads. This sharing and migration must be transparent.

The requirements for the default-launch tasks are the following:

  • The runtime can either run such task asynchronously or synchronously
  • When it’s run synchronously, it should be run in the context of the parent thread
  • When it’s run asynchronously, it should behave as if it were run on a separate thread

Strictly speaking, a task could always call this_thread::get_id() and fool any attempt at thread sharing or migration by discovering the ID of the current thread. In general, the namespace std::this_thread, which also contains sleep functions, is thread-bound.

But let’s suppose that we only require that asynchronous tasks behave as if they were run on separate threads, except when they call functions in the this_thread namespace. There are still several problems.

Thread-Local Variables

C++11 introduced a new thread_local storage qualifier. A thread-local variable is separately initialized and destroyed in every thread. It must not survive thread termination. This requirement complicates thread sharing.

In our exchange, Hans Boehm clarified the termination requirement for tasks: Thread-local variables must be fully destroyed before the owning thread returns from calling get or wait on a future produced by the corresponding std::async; or before the destructor of that future returns, whichever comes first.

This actually leaves some wiggle room: A thread could be reused if the runtime guarantees that thread-local variables of terminating tasks are correctly destroyed. Unfortunately, this might be impossible if the programmer calls OS-specific API, like Windows’ TlsAlloc. Anthony also pointed out that it’s not clear how to deal with DLL_THREAD_DETACH handlers in DLLs, when switching to task granularity.

Locks

There’s another aspect of C++11 concurrency that is tied to threads — locking. The std::mutex object is thread aware. It requires that unlock is called from the same thread as lock. Why should this be a problem?

I haven’t mentioned yet that task migration might be necessary in the middle of execution, but that is what most task-based systems do. It’s an optimization in the case when you’re dealing with blocking tasks.

There are two major blocking scenarios: external and internal. External blocking happens when a task calls an OS function (directly or indirectly) that may block, for instance waiting for I/O. My directory listing program did a lot of that. Internal blocking, which is potentially easier to intercept, happens when tasks are blocked on futures. My program did a lot of that too, when waiting for the results of tasks that were listing subdirectories of the current directory.

A blocked task doesn’t use processor resources, but it does occupy a thread. That thread could be reused to run another task. But that requires a clean way of taking a task off a thread and then restoring its state once the call unblocks. Now, if the task takes a lock on a mutex before blocking, it cannot be migrated to another thread. The unlocking wouldn’t work from another thread.

Herb Sutter observed that, if we tried to restore the task to its original thread, we might get into a spurious deadlock, when the original thread is occupied be another task waiting for the same mutex.

The other problem with locks is in the use of a recursive_mutex. A thread may lock such a mutex multiple times before calling unlock (also multiple times). But if a second thread tries to lock a mutex that’s owned by the current thread, it will block. As long as tasks run is separate threads, this works. But if they share the same thread, they may successfully acquire the same mutex and cause data corruption.

Imagine the following scenario. Task A wants to modify a shared data structure and takes a lock on its recursive mutex. It then blocks on some OS call (probably not a good idea in general, but it may happen). The task gets taken off of the current thread, and task B starts executing. It takes a lock on the same mutex — successfully, as it is executing in the same thread, and reads or modifies a data structure that was in the middle of being modified by task A. A disaster unfolds.

Notice that this is not a problem if tasks are run serially in the same thread, as it happens with the launch::deferred policy, because each task runs to completion before allowing another task to run.

Finally, such migration of running tasks would also wreaks havoc with thread-local variables.

Possible Solutions

Optimizing the Default Launch Policy

The easiest part was to change the implementation of the default policy, to defer the decision whether to run a given task asynchronously or as deferred. Anthony was quick to notice this, and almost immediately released a fix — version 1.7 of Just::Thread.

The idea is simple, you schedule N tasks asynchronously — N being some runtime number dependent on the number of available cores — and put the rest on a queue. When any of the queued tasks is forced (by the call to get or wait on its future), it is executed synchronously in the context of the forcing thread — as if the launch::deferred policy were used. Otherwise, as soon as one of the asynchronous tasks finishes, the next task from the queue is scheduled to run asynchronously. Here’s the output of the same test program after the changes in Just::Thread:

The output of the test with the new library


This time each task ran in a separate thread, but because of the default launch policy, they ran in small batches that could effectively exploit the parallelism of a multicore machine. Still, without thread reuse, the runtime had to create 22 OS threads. The hope is that the operating system caches thread resources so that the creation of the second batch of threads is substantially cheaper than the first one.

(I suggest running this test when evaluating any implementation of a task library.)

Thread Reuse

The next step in improving task performance would be to use a thread pool and reuse threads instead of creating them from scratch. Because of the problem with thread-local variables, it might be impossible to implement thread reuse without some help from the language runtime. The task library would need hooks into every creation of a thread_local variable, so it can destroy them at task exit.

That still leaves the problem of tasks calling APIs like TlsAlloc directly. An atractive option (for library writers) would be to ignore the problem — after all the language provides a portable way of dealing with thread-local storage.

Task Migration

We would like to be able to remove a blocked task from a thread in order to run another task on it. This is not easy because of thread-locals and locks.

The problem with thread_local variables is that they should really be task-local. Or at least they should behave “as if” they were task-local. So when two tasks are sharing the same thread, there has to be some mechanism for “context switching” between them. The context would have to include the state of all thread-local variables.

Migrating a task that is holding a lock could only be done if locks were task-bound rather than thread-bound. Interestingly, there is a provision in the Standard for this kind of behavior. The definition of Lockable in (30.2.5) talks about “execution agents” that could be threads, but could also be something else. This comment is of particular interest:

[ Note: Implementations or users may introduce other kinds of agents such as processes or thread-pool tasks. —end note ]

However, the Standard Library mutex is bound to threads, not tasks. The intention of (30.2.5) is that, if you create your own separate task library with your own task-local variables and mutexes, you will still be able to use the standard utilities such as lock_guard or condition variables. But the implementation of std::async tasks must work with thread_local and std::mutex.

Deadlocks

Here’s a potential scenario where two tasks could deadlock if their threads are reused while they are blocked:

  1. Task A runs on thread T1, takes the mutex M1, and makes a blocking call
  2. The runtime takes A off T1 (saves its state, etc.) and puts it in a Blocked queue
  3. Task B starts executing on the same thread, T1, and tries to take M1, which is locked by A
  4. In order to unlock M1, task A would have to run on T1 — the same thread the lock was taken on — but T1 is now occupied by B, and A can’t make progress

The only way to resolve this deadlock is to take B off the thread. So that’s what a task migration system must do — guarantee that any blocked task is taken off its thread.

In general, any spurious deadlock would involve a bunch of blocked tasks. If all of them are blocked on locks, this is an actual deadlock which would happen anyway. If there is at least one task that can make progress when its blocking call returns, it can always be assigned back to its thread, either because the task running on it completes, or because it’s blocked and will be taken off of it.

Of course if we allow lock migration, as in associating locks with tasks rather than threads, the problem disappears on its own.

Conclusion

What I learned from this exercise was that std::async with default launch policy can be made usable. However its strong association with threads makes it virtually impossible to implement full-blown task-based parallelism. A task-based system could be implemented as a library but it would have to come with severe restrictions on the use of thread_local variables and standard mutexes. Such a system would have to implement its own version of task-local variables and mutexes.

I’m grateful to Anthony Williams, Hans Boehm, Herb Sutter, and Artur Laksberg for enlightening discussions.

Appendix: async Refresher

Here’s some typical code that uses std::async to start a task:

auto ftr = std::async([](bool flag)->bool
{
    if (flag)
        throw std::exception("Hi!");
    return flag;
}, true); // <- pass true to lambda
// do some work in parallel...
try
{
    bool flag = ftr.get(); // may re-throw exception
}
catch(std::exception & e)
{
    std::cout << e.what() << std::endl;
}

The code calls std::async with a lambda (anonymous function) that takes a Boolean flag and returns a Boolean. The lambda can either throw an exception or return the flag back. The second argument to async (true, in this case) is passed to the lambda when it is executed.

The value or the exception is passed back to the parent code when it calls the get method on the future returned by async. The call to async may create a new thread, or defer the execution of the function until the call to get is made.

The same code may be implemented directly using std::thread, std::promise, and std::future but, among other things, it requires modifications to the thread function (here, to the lambda):

std::promise prms;
auto th = std::thread([](std::promise<bool> & prms, bool flag)
{
   if (flag)
     prms.set_exception(std::make_exception_ptr(std::exception("Hi!")));
   else
     prms.set_value(flag);
}, std::ref(prms), true);
// do some work
th.join();
auto ftr = prms.get_future();
try
{
   bool flag = ftr.get();
}
catch(std::exception & e)
{
   std::cout << e.what() << std::endl;
}

By the way, I haven’t been slacking off. The C++11 Concurrency Series of video tutorials is now at the fifth installment (there are links to the previous tutorials as well). I’ve covered threads, fork/join, futures, promises, async tasks, task-based concurrency and much more. Next time I’ll try to gently introduce elements of map/reduce. I’m surprised how much can be done without locks and traditional synchronization (although those will come later in the series).

Follow

Get every new post delivered to your Inbox.

Join 35 other followers