x86


Writing a hypervisor for a virtual machine requires a lot of effort and is comparable in complexity with writing an operating system.

A standalone OS must support a lot of functionality and a large number of hardware configurations. However, it’s not really that hard to write a stripped down kernel. Imagine that you only care about the memory subsystem and a subset of interrupts, but don’t have to run more than one process at a time, and you don’t have to deal with I/O devices. I bet you could write such a system in a few weeks (and spend a few years debugging it;-).

Likewise, writing a full-blown virtual machine hypervisor that can multiplex several OSs and interact with various hardware configurations requires a lot of effort. In particular, a hypervisor must be able to virtualize a large variety of devices. But a stripped down hypervisor that deals with one OS at a time, and doesn’t monitor any devices is within reach of a small dedicated team of systems programmers (such as our team at Corensic). And there are plenty of applications for such a thin hypervisor, applications which have nothing to do with cramming multiple operating systems into one box.

A thin hypervisor can transparently hook into many low level activities, whether to monitor or to emulate them. Because a thin hypervisor is much smaller than the OS, it’s also easier to verify its code and incorporate it in the trusted computing base (TCB). McAfee’s DeepSAFE technology relies on this property to protect the system from security threats. Corensic Jinx uses a thin hypervisor to monitor patterns of memory access to detect concurrency bugs. I’m sure there will be many more uses for thin hypervisors once the word gets out that they are feasible to implement.

I will concentrate on the architecture of the Jinx thin hypervisor, since I know it better than others, and because many implementation ideas are common to most thin hypervisors. In my previous blog I described the basics of virtualization so please read it if you’re not familiar with some of the terms I use here.

A full-blown “thick” hypervisor is really an operating system and, like any OS, is loaded during bootstrap. Being the first system to boot has its advantages: A thick hypervisor can start lying to the guest OS from the very beginning. When it loads the guest OS, it can cheat about the type of a processor, the number of cores, the size of physical memory, the presence and capabilities of I/O devices, etc. It’s quite common for a thick hypervisors to present to the OS a virtual machine with very limited capabilities. This way it can provide virtual drivers for a standard subset of devices rather than support all the latest and greatest.

In contrast, a thin hypervisor is inserted after the operating system has fully booted. So the OS knows very well what hardware it’s running on, and the opportunities for cheating are much more limited.

Loading the Thin Hypervisor

How does the thin hypervisor sneak under the operating system? The OS always starts in ring 0 but, when virtualization is enabled in the BIOS, the operating system starts in the host mode of ring 0 (as opposed to the guest mode — see my previous blog). The OS has the highest privilege you can get. So the question is not how to shimmy the hypervisor under the OS but rather how to dislodge the OS from host mode.

This can only be done by code that runs inside the kernel with the highest privilege level. And the only way to install application code in the kernel is in the form of a device driver. This is why, when installing a thin hypervisor like Jinx, the user needs to elevate his or her security level. In the case of Windows, the driver needs to be signed by Microsoft.

The driver sets up the VMCB (Virtual Machine Control Block — see my previous blog) and prepares handlers for particular events that are to be trapped by the hypervisor. It sets the instruction pointer in the VMCB to point to a specific offset in the driver code and executes VMRUN. The processor switches to guest mode and jumps to the specified offset. From that point on, the OS, including the driver, runs in guest mode until it traps to the hypervisor and executes one of the handlers in host mode.

The opposite procedure — the unloading of the hypervisor — is also easy. A thin hypervisor may jump in and out from under the OS at any time.

If inserting a hypervisor is so simple, you might be wondering if the same procedure couldn’t also be used to install an almost undetectable virus on somebody’s machine. Indeed, as I already mentioned in my previous blog, Joanna Rutkowska, a security researcher from Poland, shocked the community by demonstrating such an attack called Blue Pill on Windows Vista. Part of the attack was installing an unsigned device driver which, as it turned out, was possible on Vista RC1 (it was done by forcing existing drivers to be paged out and then modifying their code directly in the page file). This security hole has since been closed, but still, the specter of a thin hypervisor attack led most hardware manufacturers to, by default, disable the virtualization in the BIOS. This is why you don’t see malware using this kind of attack (knock on wood!). But, as a side effect of this defensive behavior, the first time installation of Jinx might require the enabling of virtualization in the BIOS. You have to do it during the boot sequence — a minor nuissance.

Communicating with the Hypervisor

The way you talk to the hypervisor, either from user mode or from the driver, is a bit tricky. We decided to trap the CPUID instruction into the hypervisor. It’s a rarely used unprotected instruction and it’s not performance sensitive (that is, nobody calls it in a tight loop a million times). CPUID returns various pieces of info about the processor, depending on the value stored in EAX. Only a few values are meaningful, the rest are up for grabs. Our hypervisor sets a trap for CPUID and inspects EAX for a special value. For all other values, it forwards CPUID to the physical processor.

The communication back from the hypervisor is easy since the hypervisor has complete control over virtual memory and can map any data into user’s or driver’s address space.

Passive and Active Mode

Jinx uses the hypervisor for stochastic testing of a running program. It lets the program run without interference most of the time and periodically, for short intervals, turns to active mode. This way Jinx doesn’t noticeably slow down the running program, which is important for realistic testing.

The Jinx hypervisor starts in passive mode, in which it does nothing other than service CPUID traps (actually it is also sensitive to nonmaskable interrupts, NMIs, which I’ll discuss later). But when the hypervisor gets the prearranged signal, it switches to active mode.

In active mode, Jinx is mostly interested in tracking memory accesses. This requires trapping page faults, CR3 accesses, and attempts at I/O. As I mentioned before, thin hypervisors don’t provide their own device drivers, so they let the OS deal with I/O using the OS’s own drivers. Jinx doesn’t even do that. It intercepts all attempts at communication and simply stalls the corresponding thread until the end of its short testing period.

Since Jinx does stochastic sampling, it frequently switches between passive and active modes. During normal operation — sampling a program for concurrency bugs — the entry to active mode is triggered by the timer, which sends a non-maskable interrupt (NMI) to the OS. This is why our hypervisor traps NMIs when it’s in passive mode.

Managing Memory

Since a thin hypervisor doesn’t have to multiplex operating systems, it can use a much simplified scheme for memory management. For instance, it may make host physical addresses the same as guest physical addresses. Or, like the Corensic hypervisor, use a copy on write scheme. The pages that are only read use identity mapping from guest physical to host physical. But as soon as a page is written to, the Jinx hypervisor (inside the page-fault handler) makes a copy of it and maps its guest physical address to the host physical address of the copy.

This is how Jinx is able to virtualize the execution of the program during active periods. As long as there is no communication with external devices, Jinx simulations don’t modify any of the original memory pages. They can be restarted and deterministically re-run using the logs of memory accesses that are kept by Jinx. This is the technology that enables Jinx’s algorithms to detect data races (for more information see Pete Godman’s post).

Devices

During a virtual run, as soon as the debugged program tries to access an external device, Jinx has to do something. Otherwise the program’s virtual state would become observable and — if you like analogies with Quantum Mechanics — its wave function would collapse. To prevent that, Jinx blocks the thread that is trying to do I/O until the end of the simulation. When the program is finally run in the real world (as opposed to virtual execution), it will perform the stalled I/O.

In order to implement this behavior, a thin hypervisor must intercept all I/O attempts. Those involve the in and out instructions, interrupts, and the hardest one — memory mapped I/O.

Conclusion

I like to think of a hypervisor as an extension of hardware. It lets you observe the state and modify the behavior of CPUs, memory, and devices. This can be done through hardware, for example using ICE (In-Circuit Emulator), but the software approach is more flexible.

A hypervisor offers a unique ability to look at the executing program and the OS from the hardware perspective. You can not only observe their interactions but also modify the behavior of both software and hardware on the fly. You can even build an interpreter of the instruction stream directed at the processors. Or, like Jinx, you can virtualize and replay the executions of small intervals of a program.

Acknowledgments

I’d like to thank the members of the Corensic team for explaining to me the workings of the Jinx hypervisor and for feedback on this blog.

Bibliography

  1. Joanna Rutkowska, Subverting Vista Kernel for Fun and Profit
  2. Heradon Douglas, Thin Hypervisor-Based Security Architectures for Embedded Platforms
  3. Takahiro Shinagawa et. al., BitVisor: A Thin Hypervisor for Enforcing I/O Device Security
  4. David E. Lowell, Yasushi Saito, and Eileen J. Samberg, Devirtualizable Virtual Machines Enabling General, Single-Node, Online Maintenance

In my previous blog I explained how a virtual machine (VM) manages memory behind the back of the operating system. Let’s face it, it’s all based on a lie perpetrated by the hypervisor. Every time the operating system is on the verge of discovering the lie, it is knocked unconscious by the hypervisor who then fixes up the OS’s version of reality. There is a scene in the movie Barnyard that graphically illustrates this principle: The farmer is about to discover that farm animals are intelligent when he’s kicked in the head by the mule.

To implement that mechanism, some help from the hardware is needed. The idea is that the execution of certain instructions or access to certain resources will cause a hardware trap that can be processed by the hypervisor. It’s the same principle that is used by the OS to create virtual reality for user processes.

Rings of Protection

The x86 chip if full of vestigial organs, relics of some long forgotten trends in operating system design. One of them is the segmented memory architecture, another is the four rings of protection. Three of the rings, 1, 2 and 3, correspond to user mode, while the OS kernel lives in ring 0. In practice, operating systems use just two rings, usually rings 3 and 0, to separate user processes from the kernel and from each other.

The hardware makes this illusion possible by making some processor instructions available only in ring 0. An attempt to execute them in user mode will result in a fault. Protected instructions include those that modify control registers, in particular the page directory register CR3 and the Interrupt Descriptor Table Register, IDTR. (There are also sensitive instructions that can only be executed with the appropriate I/O privilege level — those deal with I/O and interrupt flags.) The OS, having exclusive access to privileged instructions can set up page tables and interrupt handlers — including the page-fault handler. With those, it can hide its own data structures (including said page tables and the IDT) and its code from user processes.

The OS switches from kernel mode to user mode by setting the lowest bit in the CR0 register. But since access to control registers doesn’t work in user mode, switching back to kernel can only be done through a fault or an interrupt.

Hardware Virtualization

With all these rings of protection you’d think that virtualization of the x86 should be a piece of cake. Just put the hypervisor in ring 0, the OS in ring 1, and user processes in ring 3. Unfortunately this doesn’t work, and for many years the x86 was considered non-virtualizable. I’ll describe later how VMware was able to make this work using some pretty amazing tricks. Fortunately, in around 2005, both Intel, with its VT-x, and AMD, with its AMD-V, introduced hardware support for virtualization. The idea was to further split ring 0 into two layers, host mode and guest mode — the hypervisor running in the host mode and the OS in the guest mode. (The two modes are also called root and non-root.)

In this model, the hypervisor has complete control over hardware and memory. Before transiting into guest mode to give control to the OS, it prepares a special Virtual Machine Control Block (VMCB), which will set the state of the guest. Besides storing copies of guest system registers, VMCB also includes control bits that specify conditions under which the host will trap (exit) into the hypervisor. For instance, the hypervisor usually wants to intercept: reads or writes to control registers, specific exceptions, I/O operations, etc. It may also instruct the processor to use nested page tables. Once the VMCB is ready, the hypervisor executes vmrun and lets the OS take control. The OS runs in ring 0 (guest mode).

When any of the exit conditions that were specified in the VMCB are met, the processor exits into the hypervisor (host mode), storing the current state of the processor and the reason for the exit back into the VMCB.

The beauty of this system is that the hypervisor’s intervention is (almost) totally transparent to the OS. In fact a thin hypervisor was used by Joanna Rutkowska in a successful attack on Windows Vista that stirred a lot of controversy at the 2006 Black Hat conference in Las Vegas. Borrowing from The Matrix, Rutkowska called her rootkit Blue Pill. I’ll talk more about thin hypervisors in my next post.

Software Virtualization

Before there was hadware support for virtualization in the x86 architecture, VMware created a tour de force software implementation of VM. The idea was to use the protection ring system of the x86 and run the operating system in ring 1 (or sometimes 3) instead of ring 0. Since ring 1 doesn’t have kernel privileges, most of the protected things the OS tries to do would cause faults that could be vectored into the hypervisor.

Unfortunately there are some kernel instructions that, instead of faulting in user mode, fail quietly. The notorious example is the popf instruction that loads processor flags from the stack. The x86 flag register is a kitchen sink of bits, some of them used in arithmetic (sign flag, carry flag, etc.), others controlling the system (like I/O Privilege Level, IOPL). Only kernel code should be able to modify IOPL. However, when popf is called from user mode, it doesn’t trap — the IOPL bits are quietly ignored. Obviously this wreaks havoc on the operation of the OS when it’s run in user mode.

The solution is to modify the OS code, replacing all occurrences of popf with hypervisor calls. That’s what VMWere did — sort of. Since they didn’t have the source code to all operating systems, they had to do the translation in binary and on the fly. Instead of executing the OS binaries directly, they redirected the stream of instructions to their binary translator, which methodically scanned it for the likes of popf, and produced a new stream of instructions with appropriate traps, which was then sent to the processor. Of course, once you get on the path of binary translation, you have to worry about things like the targets of all jumps and branches having to be adjusted on the fly. You have to divide code into basic blocks and then patch all the jumps and so on.

The amazing thing is that all this was done with very small slowdown. In fact when VMware started replacing this system with the newly available VT-x and AMD-V, the binary translation initially performed better than hardware virtualization (this has changed since).

Future Directions

I’ve been describing machine virtualization as a natural extension of process virtualization. But why stop at just two levels? Is it possible to virtualize a virtual machine? Not only is it possible, but in some cases it’s highly desirable. Take the case of Windows 7 and its emulation of Windows XP that uses Virtual PC. It should be possible to run this combo inside a virtual box alongside, say, Linux.

There is also a trend to use specialized thin hypervisors for debugging (e.g., Corensic’s Jinx) or security (McAfee’s DeepSAFE). Such programs should be able to work together. Although there are techniques for implementing nested virtualization on the x86, there is still no agreement on the protocol that would allow different hypervisors to cooperate.

In the next installment I’ll talk about thin hypervisors.

Bibliography

  1. Gerald J. Popek, Robert P. Goldberg, Formal Requirements for Virtualizable Third Generation Architectures. A set of formal requirements that make virtualization possible.
  2. James Smith, Ravi Nair, The Architecture of Virtual Machines
  3. Ole Agesen, Alex Garthwaite, Jeffrey Sheldon, Pratap Subrahmanyam, The Evolution of an x86 Virtual Machine Monitor
  4. Keith Adams, Ole Agesen, A Comparison of Software and Hardware Techniques for x86 Virtualization
  5. Paul Barham et al., Xen and the Art of Virtualization.
  6. Muli Ben-Yehuda et al., The Turtles Project: Design and Implementation of Nested Virtualization

Just like the OS creates virtual reality for user processes, the hypervisor creates virtual reality for operating systems. In my last blog I talked about virtual memory. I’ll continue by describing how the hypervisor further virtualizes the memory seen by the operating system.

The operating system is responsible for allocating physical pages and mapping them to virtual addresses. When the operating system boots, it asks the hardware how much physical memory there is. If your machine has 4GB of RAM, the OS will know that it can allocate a million or so 4kB pages starting at offset 0. If you want to run more than one OS on the same hardware, the hypervisor will have to provide each of the guest OSs with the illusion of a more or less contiguous block of physical memory starting at address 0, because that’s what all OSs expect.

The simplest way to do that is to introduce one more level of virtualization — one more level of mapping between what the guest OS considers a physical address (it’s called guest physical address) and the actual physical address (which is called host physical address). In other words, a virtual address is first translated to the guest physical address, and then to the host physical address — the latter translation controlled by the hypervisor.

On the left, two processes are running in the operating system OS 1, which allocates guest physical addresses from its own pool. On the right, a second operating system, OS 2, allocates addresses from another pool. The hypervisor maps guest physical addresses from both pools onto one pool of host physical addresses -- the actual DRAM.

There are several ways of implementing this scheme.

Nested Page Tables

Address translation on the x86 is implemented in hardware, with page tables stored in regular memory that is directly accessible to the OS kernel. It’s not easy to hook an additional level of indirection into this machinery. In recent years Intel and AMD started producing chips that provide hardware support for the second layer of virtualization. AMD calls its technology nested page tables (NPT), and Intel calls theirs extended page tables (EPT). Unfortunately those two differ in some important details — e.g., EPT doesn’t support accessed/dirty bits.

The usual page table walk, which happens when the mapping is not present in the TLB, goes through the guest page tables and continues into host page tables, until the physical address if found. This address then goes into the TLB to speed up future lookups. This happens in hardware without any intervention from the hypervisor.

The hypervisor has to intervene only when a page fault happens. It’s a two stage process: First the operating system processes the fault. It allocates guest physical memory from its (fake) pool and stores it in the guest page table. Remember, the OS knows nothing about the second level of indirection — it’s just your regular Linux or Windows. When the user instruction that caused the fault is restarted, address translation progresses half way, and faults again when it can’t find the guest-to-host mapping. This time the fault is vectored into the hypervisor, which allocates actual physical memory and patches the rest of the page tables.

Old generations chips don’t support NTP so a different set of software tricks is used.

Shadow Page Tables

In the absence of nested page table support, there is another cheat that’s used by virtual machines. The idea is to create a second copy of page tables called shadow page tables that map virtual addresses directly to host physical addresses (actual DRAM), and let the processor use them for address translation (so the CR3 register points to those, rather than to guest, or primary page tables).

Primary and shadow page tables

The structure of shadow page tables that are kept by the hypervisor reflects the structure of primary page tables that are kept by the operating system. There are several approaches to keeping those two structures in sync.

The naive approach would be to let the OS freely modify its page tables, e.g., in response to a page fault. The hypervisor would only intercept the privileged instruction INVLPG that the OS uses to invalidate a TLB entry. At that point the hypervisor would make a corresponding change to its shadow page tables, substituting the guest physical address with the corresponding host physical address (possibly allocating a physical page from its own pool). Unfortunately, this doesn’t work because some operating systems (I won’t be pointing fingers) occasionally forget to invalidate TLB entries.

The fallback approach is for the hypervisor to write-protect all guest page tables. Whenever the OS tries to modify them, a page fault occurs — the so called tracing fault — which is vectored into the hypervisor. That gives the hypervisor the opportunity to immediately update its shadow page tables. In this scheme the two structures always mirror each other (modulo physical addresses) but at the cost of a large number of page faults.

A common optimization is to allow shadow page tables to be lazily updated. This is possible if not all primary page tables are write protected. If an entry is present in the primary page table but is missing from the shadow page table, a page fault will occur when the hardware walks it for the first time. This is a so called hidden page fault and it’s serviced by the hypervisor, which at that point creates an entry in its shadow page tables based on the entry in the primary page tables.

Since the hypervisor must be involved in the processing of some of the page faults, it must install its own page handler. That makes all faults, real and hidden, vector into the hypervisor. A real page fault happens when the entry is missing (or is otherwise protected) in the primary page tables. Such faults are reflected back to the OS — the OS knows what to do with them. Faults that are caused by the OS manipulating page tables, or by the absence of an entry in the shadow page tables are serviced by the hypervisor.

An additional complication is related to the A/D (accessed and dirty) bits stored in page tables. Normally, whenever a page of memory is accessed, the processor sets the A bit in the corresponding PTE (Page Table Entry); and when the page is written, it also sets the D bit. These bits are used by the OS in its page replacement policy. For instance, if a page hasn’t been dirtied, it doesn’t have to be written back to disk when it’s unmapped. In a virtualized system, the processor will of course mark the shadow PTEs, and those bits have to somehow propagate to the primary page tables for the use by the OS. This can be accomplished by always starting host physical pages in the read-only mode. The hypervisor allocates a host physical page but it sets the read-only bit in its shadow PTE, which is used by the processor for address translation. Reading from this page proceeds normally, but as soon as the processor tries to write to it, there is a protection fault. The hypervisor intercepts this fault: It checks the guest page tables to make sure the page was supposed to be writable; if so, it sets the dirty bit in the guest PTE, and removes the write protection from the host PTE. After that, write access to this page may proceed without further faults.

The hypervisor keeps separate shadow page tables for each process. When a context switch occurs, the OS tries to modify the CR3 register to point to the primary page directory for the new process. CR3 modification is a protected instruction and it traps into the hypervisor and gives it the opportunity to switch the shadow tables and point the CR3 at them. The hypervisor might also chose to discard shadow page tables upon context switch, and later refill them on demand using hidden page faults.

Guest/Host Transitions

All these hypervisor tricks rely on the ability to trap certain operations performed by the operating system. The virtualization of virtual memory requires the trapping of page faults, TLB flushes, and CR3 register modifications. I’ll describe various methods to achieve this in my next blog post.

I’d like to thank David Dunn and Pete Godman for sharing their expertise with me.

Bibliography

  1. Johan De Gelas, Hardware Virtualization: the Nuts and Bolts
  2. Intel® Virtualization Technology
  3. AMD-V™ Nested Paging

Recently we’ve had yet another discussion at work about the definition of data race. You’d think that being in the business of detecting data races, we should know what a data race is. Unfortunately, it’s not so simple. Defining a data race is one of the harder problems faced by the designers of memory models (and concurrency bug checkers). Java had it figured out some time ago, and now C++11 has its own definition. But our product, Jinx, is language agnostic–it catches data races at the processor level.

The problem is that there is no one-to-one mapping between data races as defined by a higher-level language and the object code produced by the compiler. Consider a lock-free algorithm written in C++ using weak atomics (e.g., using memory_order_relaxed operations on atomic variables). When compiled for the x86 processor, the assembly instructions might look identical to those generated without atomics. But from the point of view of C++, the first version is race-free and the second is full of races. In this case the role of atomics is reduced to being nothing more than programmer’s annotation that says, “I know what I’m doing.” (And also, “This code is portable.”)

On the other hand, neither is there one-to-one correspondence between data races, as defined by a higher-level language, and concurrency bugs. Sometimes looking at how the code is actually executed on a multicore processor may flag concurrency errors that don’t manifest themselves as data races at the source level. And then there are programs that mix high-level languages and inline assembly, for which high-level analysis is impossible.

The purpose of this blog post is to gain some insight into data races at the processor level. I’ll start with the description of the x86 memory model and a definition of data race. I’ll explain the relationship between data races and sequential consistency. Then I’ll show you an example of a Linux spinlock, which contains a deliberate data race. This race can be classified as a non-triangular data race. I’ll explain what a triangular data race is and why it might be important in analyzing program execution.

The x86 Memory Model

In order to understand data races at the lowest level you have to dig into a processor’s memory model. For the x86, both Intel and AMD produced documents that provide some guidance, but fall short of defining a formal model. It took a lot of research and double guessing before academics and Intel engineers agreed on one particular model called Total Store Order (TSO).

The gist of this model is that, at least conceptually, all writes to memory from a particular core go through this core’s write buffer, which is invisible to other cores. The contents of this buffer find its way to global memory in FIFO order at some random intervals or as a result of special instructions. These special instructions are either LOCK’d instructions (the ones with the LOCK prefix) or fences (actually, only MFENCE is relevant here, because the model covers regular write-back memory, as opposed to write-combining memory or non-temporal instructions). Those instructions flush the buffer of the core that is executing them. It’s pretty surprising that such a simple model explains all the strange behaviors of the x86 processor with respect to memory loads and stores.

Total Store Order processor memory model

A few more details: Several x86 instructions are modeled as a series of actions (see our earlier blog entry). In particular, a LOCK’d instruction can be split into smaller constituents:

  1. LOCK, which flushes the buffer and locks access to global memory for all cores,
  2. loads, stores, and arithmetic actions specific to a particular instruction,
  3. UNLOCK, which flushes the buffer and unlocks access to global memory.

For instance, the execution of:

LOCK XCHG [x], EAX

may be modeled by:

  1. flushing the local write buffer,
  2. acquiring the lock on the bus,
  3. loading x from memory to a temporary register,
  4. writing EAX to x (through the local write buffer),
  5. copying the value from the temporary to EAX,
  6. flushing the write buffer (thus forcing the new value of x to be written to global memory), and
  7. releasing the bus lock.

Keep in mind that this is an idealized model, not necessarily corresponding to actual architecture or microcode.

x86 Data Races

Let’s start with the simplest definition of a data race and build things up from there. Imagine that you are single-stepping through a program that has multiple threads running on multiple cores. At each step you pick one core and execute the next instruction in its thread. For complex (but not LOCK’d) instructions, like INC[x], you might execute just one of its actions, e.g, the load of [x], at a time. This kind of step-by-step execution is called sequentially consistent, as long as all stores go directly to global memory.

A data race occurs if you can pick two actions accessing the same memory location on two separate cores to be executed next to each other, and at least one of them is a store. Notice that if these two actions can be made adjacent in a global execution order, they can also be executed in the reverse order. Now, if both actions were just memory loads, the order wouldn’t matter. This is the reflection of a more general principle that you can never have data races on immutable data. But if one of the accesses is a store, the order does matter.

This definition of data race is the foundation of dynamic race detectors, including Jinx. Jinx virtually (that is, using a virtual machine technology) re-runs parts of your program in a sequentially consistent manner, rearranging the threads in order to, among other things, bring shared memory accesses next to each other. If the program is properly synchronized, this should be impossible. For instance, one of the threads will be blocked waiting for a lock (this time I’m talking about higher-level locks, or critical sections) while the other is accessing a shared variable under the same lock.

Of course, the next question is about the locks themselves: They are implemented using the same instructions as the rest of the code, and use shared variables, so how come they don’t generate their own data races? Some race detectors recognize a few predefined patterns of instructions that correspond to locks in a particular language, library, or operating system. Again, Jinx tries to be agnostic.

Maybe unsurprisingly, higher level locks are implemented using LOCK’d instructions. So if we narrow the definition of data races to exclude conflicts between locked instructions, we should be fine. But are we allowed to play with definitions? How do we decide what is and what isn’t a data race? Are there some general principles to give us guidance? There is one I mentioned before called…

Sequential Consistency

Ideally, we would like to define data races so that they are always concurrency bugs. This isn’t easy, because we don’t have a definition of a bug. In a higher-level language, like C++, it’s enough to say that a data race leads to undefined behavior. But at a processor level, everything is defined. A processor won’t suddenly stop and say, “I don’t know what to do with this program.”

However, higher level languages give us a hint in terms of the DRF guarantee. DRF stands for Data-Race Freedom, and the guarantee is that if your program is free of data races, it is sequentially consistent (SC). Mind you, an SC program might still be buggy, and even have concurrency bugs, but at least we know how to reason about such programs. We run them in our minds, one action at a time, switching between threads at will, assuming that all writes go directly to main memory and are immediately visible to other threads. We know that this is not what really happens, but if a program is DRF, this is how it behaves.

Not all languages offer the DRF guarantee; C++ supports weak atomics that break SC without introducing data races. Nevertheless, DRF is a good goal.

What does it all mean for the x86? The definition of SC stands at the assembly level, although we have to allow for splitting complex instructions. For instance, INC[x] could be split into three actions: a load, an increment, and a store. If this instruction doesn’t have a LOCK prefix, another thread is free to execute its own actions in between the actions of the current instruction.

A TSO processor, like the x86, would be almost SC, if it weren’t for those pesky write buffers. Still, many memory-access conflicts on an x86 don’t break sequential consistency. For instance, because of the FIFO nature of the write buffers, two conflicting stores don’t break SC, and therefore are not considered a data race. Think of it this way: Inverting two adjacent stores from different threads doesn’t change anything because both stores go to separate write buffers. The external visibility of those stores depends on the order in which the buffers are discharged into global memory, and they are discharged in a sequentially consistent way.

So the only races on the x86 are those between loads and stores, except when both are LOCK’d (in which case they flush their buffers). That’s a good starting point, but a practical question remains: If we report all of those events as data races, how many of them will be considered false positives in real life? Unfortunately, the answer is: too many. It turns out that an efficient implementation of higher-order locks often contains deliberate races. It’s true that the acquisition of a lock usually starts with a LOCK’d intruction — often a CMPXCHG, also known as CAS– but the corresponding release might be implemented using an unsynchronized store. This is why the Jinx definition of a data race excludes conflicts in which even one of the instructions is LOCK’d.

Do we miss some legitimate data races because of that? Perhaps, but we haven’t seen such examples in practice yet. Interestingly, even this narrowed definition of data races flags some legitimate lock-free algorithms, in particular the Linux ultra-fast implementation of a spinlock.

Spinlock

Here’s an actual report produced by Jinx on Linux (Jinx can be used to debug the kernel):

Event: Data Race (3 times)
* Stack A
#0 0xffffffff81036d44 in __ticket_spin_lock at spinlock.h:65
#1 0xffffffff8159e7ee in arch_spin_lock at paravirt.h:744
#2 0xffffffff810909cd in futex_wake at futex.c:891
#3 0xffffffff81093288 in do_futex at futex.c:2602
#4 0xffffffff810933fb in sys_futex at futex.c:2636
* Stack B
#0 0xffffffff81036d74 in __ticket_spin_unlock at spinlock.h:101
#1 0xffffffff81090a58 in arch_spin_unlock at paravirt.h:760
#2 0xffffffff81093288 in do_futex at futex.c:2602
#3 0xffffffff810933fb in sys_futex at futex.c:2636

Two threads, A and B, ran into a race in spinlock.h. One access occurred at line 65 and the other at line 101 of that file. The two stack traces show that the caller, in both cases, was a futex, or a “fast mutex”– a Linux implementation of a thin lock.

Let’s have a look at spinlock.h. Line 65 is inside the function __ticket_spin_lock, which contains a bunch of inline assembly, which I’m not going to analyze here (although I will analyze similar code shortly):

 61 static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 62 {
 63     short inc = 0x0100;
 64
 65     asm volatile (
 66         LOCK_PREFIX "xaddw %w0, %1\n"
 67         "1:\t"
 68         "cmpb %h0, %b0\n\t"
 69         "je 2f\n\t"
 70         "rep ; nop\n\t"
 71         "movb %1, %b0\n\t"
 72         /* don't need lfence here, because loads are in-order */
 73         "jmp 1b\n"
 74         "2:"
 75         : "+Q" (inc), "+m" (lock->slock)
 76         :
 77         : "memory", "cc");
 78 }

The thing to notice is that this is the acquisition of a (ticketed) spinlock, the first instruction of which is indeed LOCK’d. Since Jinx doesn’t flag races with LOCK’d instructions, it must be the next instruction that accesses memory, movb, which is involved in the race. It reads a byte-sized value from memory without using a LOCK.

The other racing line, 101, is inside __ticket_spin_unlock:

 99 static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
100 {
101     asm volatile(UNLOCK_LOCK_PREFIX "incb %0"
102          : "+m" (lock->slock)
103          :
104          : "memory", "cc");
105 }

This is the release part of the ticketed spinlock, and it contains an increment instruction that accesses the memory (the same lock->slock variable as in the previous function). The macro, UNLOCK_LOCK_PREFIX, expands to nothing on processors other than Pentium Pro (which had an error). So we do have an unprotected store (part of incb) that races with an unprotected load (movb in the previous function). And yet, the code is correct. But to understand that, we need to learn about a more sophisticated notion of…

Triangular Data Races

Let’s go back to the x86 TSO memory model. The cores write into their private FIFO buffers. Data moves from these buffers into main memory. At any point in time only one buffer may discharge a previous write entry into memory, so those writes end up interleaved with each other. What’s important is that all processors see those writes appearing in main memory in the same order — the Total Store Order. (To be precise, the processor doing the write will see it immediately due to intra-processor forwarding, but that doesn’t matter in our case.)

Now imagine that you have an actual execution on the x86 and you want to show that it’s equivalent to an SC execution. You could do it, for instance, by rearranging instructions so that stores are moved up to the point where they were originally discharged into global memory. In order to preserve program order on individual processors, you would also have to move up the loads (and other instructions) that appear between those stores. But if you move a load forward too much, it might see a different store than it did in the original execution, and the new SC execution will not be equivalent to the original one.

A typical situation is when stores and loads cross each other, as in this program (x and y originally zero):

P1 P2
mov [x], 1 mov [y], 1
mov eax, [y] mov eax, [x]

If stores from both processors still sit in their respective buffers while the loads are performed, both eax registers will end up with zeros. If we try to move the stores forward to the point where the buffered writes were flushed, we’ll also have to push the loads, and they will end up reading ones instead of zeros. This is clearly a non-SC execution and the races it contains are definitely of interest to us.

Rearranging stores to create an equivalent execution with no buffering. Because program order must be preserved, the loads are rearranged too, resulting in different values being read. The SC execution is not equivalent to the original one. (Time flows from left to right.)

It turns out that all violations of SC on the x86 have one thing in common–a triangular data race. This is a triangular pattern (We have two of those in our example.):

P1 P2 Comment
mov [x], 1 Preceding store
No lock or fence
mov eax, [y] mov [y], 1 The race between a load and a store

More generally, a triangular data race occurs between a load action on one processor and a store action on another if the load is preceded by some other store. In the case above we have a race on y, and what makes it triangular is the odd store to x. The key is that this store may still sit in the write buffer when the racing load is made. If, on the other hand, there is any fence or LOCK/UNLOCK between the odd store and the racing load, we are fine, the race is not triangular.

What’s amazing is that it can be proven that any x86 program that doesn’t contain triangular data races is sequentially consistent. (For a formal definition of a triangular data race and a proof of the theorem, see Owen.) Armed with this knowledge, let’s revisit the spinlock example.

Spinlock and Triangular Races

The Linux ticketed spinlock is an elaboration on a basic spinlock. Here’s the (more readable) Intel assembly code for that simplified spinlock. There is a variable, lck, on which to spin, and its initial value is 1, corresponding to the lock being unclaimed. The address of lck is stored in eax.

Label Instruction Comment
acquire: lock dec [eax] atomic {
tmp = [lck] – 1
[lck] = tmp
flag = (tmp >= 0)
flush local write buffer }
jns enter if (flag) goto enter
spin: cmp [eax], 0 flag = ([lck] <= 0)
jle spin if (flag) goto spin
jmp acquire goto acquire
enter: the critical section starts here

To acquire the lock, we first atomically decrement lck. If the lock was free, the decremented value should be non-negative (actually, zero). If it isn’t, we start spinning, waiting for lck to become positive, and then try to acquire it again.

To release the lock, we set lck back to 1:

release: mov eax, 1 [lck] = 1

Notice that the load part of the instruction cmp [eax],0 is in a data race with mov eax,1. This is the race that is detected by Jinx.

But does this race break sequential consistency? In order to break SC, the race would have to be triangular, which means that the racing load would have to be preceded by some other store that’s not flushed. But the last store before entering the spin loop is part of the LOCK’d instruction, lock dec [eax], which flushes the write buffer. Since there is no triangular race, the spinlock is indeed sequentially consistent.

Conclusion

It’s eye opening to realize how much wiggle room there is in the definition of data race. We have the informal definition: Two simultaneous accesses to the same memory location, one of them being a write. Except that now we have to define “simultaneous,” which is a bit tricky. Simultaneity is characterized by lack of ordering–two events are simultaneous if it’s impossible to tell which came before the other.

Events on different processors may be ordered through synchronizing actions. At the lowest level, these are LOCK’d instructions and fences. However, there is no simple mapping between high-level-language definitions of data races and what is observed at the processor level during execution.

Ultimately, the reason for defining data races is to be able to identify one type of concurrency bugs. But bug detection is in general undecidable, so the best we can do is to define behaviors that are most likely related to programmers’ errors. Most programmers reason about concurrent programs using the sequentially consistent model of execution. This is why the most insidious concurrency bugs are related to the breakdown of sequential consistency. There is a linkage between data races and (the lack of) sequential consistency, which is expressed in terms of the DRF guarantee: A program with no data races behaves in a sequentially consistent way.

We can reverse the reasoning and define data races in a way that would guarantee the DRF property, and this is what is usually done. Seen from that perspective, it’s possible to strengthen the DRF guarantee on TSO processors. The new TRF (Triangular Race Freedom) guarantee states that a program with no triangular data races behaves in a sequentially consistent way and vice versa.

We may reasonably assume that non-triangular data races are not bugs but rather were deliberately coded, either directly by programmers (as was the case with the Linux spinlock), or by a compiler translating a sequentially consistent high-level program. The jury is still out, though, because outside of lock-free programming any race, triangular or not, is still very likely to be a bug.

Acknowledgments

I’d like to thank Anthony Williams, Dmitriy V’yukov, and my co-workers for valuable feedback.

Bibliography

  1. Scott Owens, Susmit Sarkar, Peter Sewell, A better x86 memory model: x86-TSO
  2. Scott Owens, Reasoning about the Implementation of Concurrency Abstractions on x86-TSO

Memory consistency models have an almost mythical aura. They can puzzle the most experienced programmers and lead to bugs that are incredibly hard to understand and fix. If you have written multithreaded code, it is likely that you have stumbled upon memory model woes. Chances are that you have also lost bets with your colleagues because of memory consistency model disputes. In this blog post I will discuss some of the rationale of why memory models were created and give some specific examples of how that affects you.

Memory Consistency Models

First things first, lets define what a memory model is. A memory consistency model defines what values a given read operation may return. The simplest memory model is the Sequential Consistency model, in which each execution behaves as if there were a single global sequence of memory operations, and the operations of a given thread appeared to all threads in the same order as they appear in the program (program order). It is the most natural model for normal humans to think about, because the execution behaves as if it were run on a multitasking uniprocessor.

Consider the following example:

Communicating processors

Two processors communicating through shared memory

The question is, what value can the read of data in P2 return? The most obvious answer here is 42. But what would you say if P2 observed the writes to data and flag in the opposite order? P2 could actually read data as “0″, which is surprising and not allowed by the sequential consistency memory model. And yet…

The main problem with sequential consistency is that it prevents systems from reordering memory operations to hide long latency operations and improve performance. For example, when a cache miss is being serviced, the processor could execute another memory access that comes after it in program order. That access may hit the cache and therefore complete earlier than the delayed access. Abandoning sequential consistency at the processor level results in dramatically improved performance.

But processors are not the only source of memory operation reordering. Many compiler optimizations effectively reorder code, e.g., loop-invariant code motion, common sub-expression elimination, etc. Furthermore, memory models of languages and memory models of the hardware they run on need not be the same. This is why compilers and synchronization libraries need to insert fences in the generated code. They have to map the language memory model to the hardware model. For example, Java and C++11 (the upcoming C++ standard) support memory models that guarantee sequential consistency for programs free of data races (although C++ also offers ways of relaxing sequential consistency without introducing races–the so called weak atomics).

Due the difficulty in improving performance under sequential consistency, a variety of “relaxed” memory models were conceived. For example, in the Weak Ordering memory model, there is no guarantee that a processor will observe another processor’s memory operations in program order. This is where a “memory fence” (a.k.a. “memory barrier”) comes into play. When a fence instruction is executed, it guarantees that all memory operations prior to it in program order are completed (and visible to other processors) before any operation after the fence in program order is allowed to proceed. You would be bored and stop reading if I described the multitude of consistency models in this post. However, I do encourage you to read more about memory models in this very nice tutorial by Sarita Adve and Kourosh Gharachorloo. Also, Paul McKenney’s paper has a nice table summarizing the ordering relaxations in modern microprocessors.

The x86 Memory Model

Now let’s talk about some of highlights of the x86 memory model. A big disclaimer first. This can change and probably does change between models, so it is always a good idea to check the manuals before endeavoring in sensitive code (8-8 Vol. 3 in this manual for Intel and Section 7.2 in this manual for AMD).

In a nutshell, recent implementations of the X86 ISA (P6 and on) follow, roughly, what is normally termed total-store order (TSO), which is stronger than “processor consistency” (and what people often think the x86 model to be). Its key ordering properties are:

  1. reads are not reordered with respect to reads;
  2. writes are not reordered with respect to reads that come earlier in the program order;
  3. writes are not reordered with respect to most writes (excluding, e.g., multiple writes implicit in string operations like REP MOVSB);
  4. reads may be reordered with respect to writes that come earlier in program order as long as those writes are to a different memory location;
  5. reads are not reordered with respect to I/O instructions, locked instructions and other serializing instructions.

There are no guarantees whatsoever of ordering between writes of different processors, the outcome of concurrent writes to the same memory location is non-deterministic. Increment instructions have no atomicity guarantees; moreover, even some write operations that update multiple bytes are not guaranteed to be atomic (see Andy’s blog post). For example, if a write operation to multiple bytes happen to cross a cache line boundary, the operation is not guaranteed to be atomic.

Here is an example of how the x86 memory model can get you in trouble:

The outcome t1 == 0 and t2 == 0 is possible on the x86

An execution whose final state is t1 == 0 and t2 == 0 is allowed. Such an outcome is unintuitive, non-sequentially consistent, because there is no serialized execution that leads to this state. In any serialized execution, there will be an assignment in one processor (A = 1 or B = 1) prior to a read in the other processor (t1 = B or t2 = A).

Another way to look at the problem is to try to build a happens-before graph of the execution. In this representation, each node is an executed instruction. A directed edge from instruction P to instruction Q is drawn if Q has observed the effects of P, and P has not observed the effects of Q, so P “happens-before” Q. (Note that, in the C++11 model, “happens-before” has a slightly different meaning than what is used in this post.)

Here is the happens-before graph for the example above when the outcome is t1 == 0 and t2 == 0:

An attempt to draw happens-before arrows for non-sequentially-consistent execution

Edge (1) exists because the read t1 = B in P1 did not observe the write B = 1 in P2. The same applies to edge (2). Edges (3) and (4) are there because of program order. Since there is a cycle in the happens-before graph, there is no serialized order that would satisfy the happens-before relationship. Therefore, the execution is not sequentially consistent. What happened in this example is that the read operation t1 = B in P1 proceeded before the write operation in A = 1 had a chance to complete and become visible to P2.

Here is another example of how the x86 memory model leads to surprising results:

The outcome t2 == 0 and t4 == 0 is possible on the x86

The snippet of execution above might lead to an unexpected state where t2 == t4 == 0. Lets look at this from the perspective of P1. This may happen because the processor can quickly forward the value from the pending write to A (A = 1) to the read of A (t1 = A) on the same processor. In the meanwhile, the other processor, P2, can’t access this pending write (because it’s still waiting in the P1’s private write buffer) and reads the old value (t4 = A). Note that this outcome cannot be explained by a simple reordering of t2 = B and t1 = A! Intuitive, huh?

One final example for you to noodle about. Consider a boiled-down version of the Dekker’s mutual exclusion algorithm for two threads:

Unintuitive outcome of a boiled-down Dekker's algorithm: both threads enter the critical section

The gist of the algorithm is to use two flag variables, one for each processor, flag1 and flag2. P1 sets flag1 when it is attempting to enter the critical section, it then checks if flag2 is set; if it is not set, it means P2 has not attempted to enter the critical section, so P1 can safely enter it. Because the x86 memory model allows reordering of loads with respect to earlier stores, the read of flag2 can proceed before the setting of flag1 is completed, which can lead to both processors entering critical sections, since P2 might have just set flag2!

That is it! I hope this helped you get a better grasp of what a memory consistency model is and understand a few of the key aspects of the x86 model. And, if you come across something that looks like a memory consistency bug, try building that happens-before graph to find cycles and remember to look at the manual 🙂 . Have fun!