Jinx


Writing a hypervisor for a virtual machine requires a lot of effort and is comparable in complexity with writing an operating system.

A standalone OS must support a lot of functionality and a large number of hardware configurations. However, it’s not really that hard to write a stripped down kernel. Imagine that you only care about the memory subsystem and a subset of interrupts, but don’t have to run more than one process at a time, and you don’t have to deal with I/O devices. I bet you could write such a system in a few weeks (and spend a few years debugging it;-).

Likewise, writing a full-blown virtual machine hypervisor that can multiplex several OSs and interact with various hardware configurations requires a lot of effort. In particular, a hypervisor must be able to virtualize a large variety of devices. But a stripped down hypervisor that deals with one OS at a time, and doesn’t monitor any devices is within reach of a small dedicated team of systems programmers (such as our team at Corensic). And there are plenty of applications for such a thin hypervisor, applications which have nothing to do with cramming multiple operating systems into one box.

A thin hypervisor can transparently hook into many low level activities, whether to monitor or to emulate them. Because a thin hypervisor is much smaller than the OS, it’s also easier to verify its code and incorporate it in the trusted computing base (TCB). McAfee’s DeepSAFE technology relies on this property to protect the system from security threats. Corensic Jinx uses a thin hypervisor to monitor patterns of memory access to detect concurrency bugs. I’m sure there will be many more uses for thin hypervisors once the word gets out that they are feasible to implement.

I will concentrate on the architecture of the Jinx thin hypervisor, since I know it better than others, and because many implementation ideas are common to most thin hypervisors. In my previous blog I described the basics of virtualization so please read it if you’re not familiar with some of the terms I use here.

A full-blown “thick” hypervisor is really an operating system and, like any OS, is loaded during bootstrap. Being the first system to boot has its advantages: A thick hypervisor can start lying to the guest OS from the very beginning. When it loads the guest OS, it can cheat about the type of a processor, the number of cores, the size of physical memory, the presence and capabilities of I/O devices, etc. It’s quite common for a thick hypervisors to present to the OS a virtual machine with very limited capabilities. This way it can provide virtual drivers for a standard subset of devices rather than support all the latest and greatest.

In contrast, a thin hypervisor is inserted after the operating system has fully booted. So the OS knows very well what hardware it’s running on, and the opportunities for cheating are much more limited.

Loading the Thin Hypervisor

How does the thin hypervisor sneak under the operating system? The OS always starts in ring 0 but, when virtualization is enabled in the BIOS, the operating system starts in the host mode of ring 0 (as opposed to the guest mode — see my previous blog). The OS has the highest privilege you can get. So the question is not how to shimmy the hypervisor under the OS but rather how to dislodge the OS from host mode.

This can only be done by code that runs inside the kernel with the highest privilege level. And the only way to install application code in the kernel is in the form of a device driver. This is why, when installing a thin hypervisor like Jinx, the user needs to elevate his or her security level. In the case of Windows, the driver needs to be signed by Microsoft.

The driver sets up the VMCB (Virtual Machine Control Block — see my previous blog) and prepares handlers for particular events that are to be trapped by the hypervisor. It sets the instruction pointer in the VMCB to point to a specific offset in the driver code and executes VMRUN. The processor switches to guest mode and jumps to the specified offset. From that point on, the OS, including the driver, runs in guest mode until it traps to the hypervisor and executes one of the handlers in host mode.

The opposite procedure — the unloading of the hypervisor — is also easy. A thin hypervisor may jump in and out from under the OS at any time.

If inserting a hypervisor is so simple, you might be wondering if the same procedure couldn’t also be used to install an almost undetectable virus on somebody’s machine. Indeed, as I already mentioned in my previous blog, Joanna Rutkowska, a security researcher from Poland, shocked the community by demonstrating such an attack called Blue Pill on Windows Vista. Part of the attack was installing an unsigned device driver which, as it turned out, was possible on Vista RC1 (it was done by forcing existing drivers to be paged out and then modifying their code directly in the page file). This security hole has since been closed, but still, the specter of a thin hypervisor attack led most hardware manufacturers to, by default, disable the virtualization in the BIOS. This is why you don’t see malware using this kind of attack (knock on wood!). But, as a side effect of this defensive behavior, the first time installation of Jinx might require the enabling of virtualization in the BIOS. You have to do it during the boot sequence — a minor nuissance.

Communicating with the Hypervisor

The way you talk to the hypervisor, either from user mode or from the driver, is a bit tricky. We decided to trap the CPUID instruction into the hypervisor. It’s a rarely used unprotected instruction and it’s not performance sensitive (that is, nobody calls it in a tight loop a million times). CPUID returns various pieces of info about the processor, depending on the value stored in EAX. Only a few values are meaningful, the rest are up for grabs. Our hypervisor sets a trap for CPUID and inspects EAX for a special value. For all other values, it forwards CPUID to the physical processor.

The communication back from the hypervisor is easy since the hypervisor has complete control over virtual memory and can map any data into user’s or driver’s address space.

Passive and Active Mode

Jinx uses the hypervisor for stochastic testing of a running program. It lets the program run without interference most of the time and periodically, for short intervals, turns to active mode. This way Jinx doesn’t noticeably slow down the running program, which is important for realistic testing.

The Jinx hypervisor starts in passive mode, in which it does nothing other than service CPUID traps (actually it is also sensitive to nonmaskable interrupts, NMIs, which I’ll discuss later). But when the hypervisor gets the prearranged signal, it switches to active mode.

In active mode, Jinx is mostly interested in tracking memory accesses. This requires trapping page faults, CR3 accesses, and attempts at I/O. As I mentioned before, thin hypervisors don’t provide their own device drivers, so they let the OS deal with I/O using the OS’s own drivers. Jinx doesn’t even do that. It intercepts all attempts at communication and simply stalls the corresponding thread until the end of its short testing period.

Since Jinx does stochastic sampling, it frequently switches between passive and active modes. During normal operation — sampling a program for concurrency bugs — the entry to active mode is triggered by the timer, which sends a non-maskable interrupt (NMI) to the OS. This is why our hypervisor traps NMIs when it’s in passive mode.

Managing Memory

Since a thin hypervisor doesn’t have to multiplex operating systems, it can use a much simplified scheme for memory management. For instance, it may make host physical addresses the same as guest physical addresses. Or, like the Corensic hypervisor, use a copy on write scheme. The pages that are only read use identity mapping from guest physical to host physical. But as soon as a page is written to, the Jinx hypervisor (inside the page-fault handler) makes a copy of it and maps its guest physical address to the host physical address of the copy.

This is how Jinx is able to virtualize the execution of the program during active periods. As long as there is no communication with external devices, Jinx simulations don’t modify any of the original memory pages. They can be restarted and deterministically re-run using the logs of memory accesses that are kept by Jinx. This is the technology that enables Jinx’s algorithms to detect data races (for more information see Pete Godman’s post).

Devices

During a virtual run, as soon as the debugged program tries to access an external device, Jinx has to do something. Otherwise the program’s virtual state would become observable and — if you like analogies with Quantum Mechanics — its wave function would collapse. To prevent that, Jinx blocks the thread that is trying to do I/O until the end of the simulation. When the program is finally run in the real world (as opposed to virtual execution), it will perform the stalled I/O.

In order to implement this behavior, a thin hypervisor must intercept all I/O attempts. Those involve the in and out instructions, interrupts, and the hardest one — memory mapped I/O.

Conclusion

I like to think of a hypervisor as an extension of hardware. It lets you observe the state and modify the behavior of CPUs, memory, and devices. This can be done through hardware, for example using ICE (In-Circuit Emulator), but the software approach is more flexible.

A hypervisor offers a unique ability to look at the executing program and the OS from the hardware perspective. You can not only observe their interactions but also modify the behavior of both software and hardware on the fly. You can even build an interpreter of the instruction stream directed at the processors. Or, like Jinx, you can virtualize and replay the executions of small intervals of a program.

Acknowledgments

I’d like to thank the members of the Corensic team for explaining to me the workings of the Jinx hypervisor and for feedback on this blog.

Bibliography

  1. Joanna Rutkowska, Subverting Vista Kernel for Fun and Profit
  2. Heradon Douglas, Thin Hypervisor-Based Security Architectures for Embedded Platforms
  3. Takahiro Shinagawa et. al., BitVisor: A Thin Hypervisor for Enforcing I/O Device Security
  4. David E. Lowell, Yasushi Saito, and Eileen J. Samberg, Devirtualizable Virtual Machines Enabling General, Single-Node, Online Maintenance

Concurrency bugs are hard to spot and hard to debug. They have a tendency to slip through the cracks of traditional testing. In this blog post I’ll go through a simple example of a buggy program and explain why the bug is so hard to reproduce and how a dynamic debugger like Jinx can accelerate this proces by orders of magnitude.

Toy Example

How do you show somebody that their program has a concurrency bug? You come up with an interleaving of threads that demonstrates the bug: “What if thread A is pre-empted between this instruction and the next one, and thread B jumps in and executes that instruction?”

Here’s a trivially incorrect example of a concurrent class:

class Circle {
   std::mutex mutex;
   double r; // radius
   double c; // circumference
public:
   Circle(double radius) : r(radius), c(2 * PI * r) {}
   void SetRadius(double radius) {
      mutex.lock();
      r = radius;
      mutex.unlock();
   }
   void UpdateCircumference() {
      mutex.lock();
      c = 2 * PI * r;
      mutex.unlock();
   }
   double GetArea() {
      mutex.lock();
      double area = 0.5 * c * r;
      mutex.unlock();
      return area;
   }
};

The problem arises if thread A is doing this:

circle.SetRadius(r1);
circle.UpdateCircumference();

while thread B is doing this:

area = circle.GetArea();

The question you might ask the unfortunate programmer during a code review (you have those, right?) is: “What if thread A is pre-empted between SetRadius and UpdateCircumference and thread B jumps in and executes GetArea?”. That’s the interleaving that exposes the atomicity violation bug in this code.

Code reviews don’t catch all bugs, so how about some defensive programming? Suppose the programmer adds an assertion, just in case:

area = circle.GetArea();
assert(area == PI * circle.GetRadius() * circle.GetRadius());

Will this help in uncovering the bug? (By the way, did you notice that this assertion has a concurrency bug of its own?)

Back-of-the-Envelope Calculation

Let’s assume that, during normal testing on a multicore machine, the non-atomic update in thread A executes once a second, and so does the the GetArea call in thread B.

We’ll hit the bug if thread B executes its _mutex.lock() in GetArea after thread A takes the lock in SetRadius and before it re-takes the lock in UpdateCircumference. That’s the window of vulnerability.

If thread B starts its call within the vulnerability window, it will read the new value of r and the old value of c. Notice that while thread A holds the lock, thread B is spinning, waiting for thread A to release the lock.

Most of the time in this window is spent acquiring and releasing the lock — which on average should take about 1000 processor cycles. On a 1GHz core that’s about 10-6s.

In our example, we would have to run the test continuously for about 106s in order to hit the bug. That’s about 10 days of testing. Of course your mileage may vary, but you see what the problem is. Concurrency bugs are very hard to reproduce.

Crash Early and Crash Often

A tool that could accelerate the detection of concurrency bugs would save the developers and testers a lot of time and effort. That’s what Corensic’s Jinx does. The basic idea is simple: Run multiple simulations of the execution of code fragments in parallel, carefully picking the candidate interleavings to increase the chances of hitting the vulnerability window.

The enabling technology in this case is virtualization. Jinx is able to periodically virtualize the machine for short periods of time in order to run its simulations.

XXX

During Simulation Round, Jinx runs several exploratory executions in virtual space, picks the most promising one, and runs it in real space. In this case, Jinx found an execution that leads to a crash.

Here’s what Jinx does when it takes control of the machine:

  1. It stops the running program and takes a snapshot of its state.
  2. It runs about a millisecond-worth of execution of the program in virtual space and records all memory accesses.
  3. It explores several alternative executions of the same fragment in virtual space. It ranks them and picks (“retires” in our jargon) the one that either crashes the program or is most likely to manifest a concurrency bug.

In the picture above, Jinx took a snapshot at point A and ran several simulations, one of them resulting in a (virtual) crash at point B. It then retired this particular simulation and ran it in real space, crashing the program at point C.

Of course, the most important part of the algorithm is the choice of executions for simulations. Random rescheduling of the threads is as unlikely to catch the bug as is regular testing. Exhaustive exploration of all possible interleavings would take astronomical time.

Smart Exploration

Imagine that you film the activities of all processors while executing a multithreaded program. Each frame corresponds to a single processor instruction. You’d end up with N tapes from N cores. Your task is to splice those tapes into one continuous movie. You can cut the tapes at arbitrary locations and interleave the fragments. Because of combinatorial explosion, the number of possible movies is overwhelming. That’s how many possible executions there are in a concurrent program. But Jinx has some additional information that cuts down this combinatorial space to a much more reasonable size.

As I mentioned before, during the first exploratory execution of a program fragment Jinx records all memory accesses. Of particular interest to us are accesses to shared memory — we recognize those when a thread touches a memory location that’s already been touched by another thread. Those shared memory accesses are our cue points — that’s where we cut the tape.

The idea is that the fragments between shared-memory accesses have no dependency on what other threads are doing, so any interleaving between them will have no impact on program behavior. What matters is the order in which the shared memory accesses happen, and their re-arrangements may produce distinct observable behaviors.

Since in a typical concurrent program there aren’t that many shared variables, the number of possible rearrangements is quite manageable.

Let’s go back to our example and study it in more detail. The simplest implementation of a mutex is based on one shared memory location, let’s call it l, and the accesse to it goes through some version of the exchange instruction (for instance LOCK CMPXCHG), which we will simply call xchg. From the point of Jinx, both lock and unlock look like xchg l. There are two more shared variables, circle.r and circle.c.

Suppose that, during the initial exploration, threads A and B executed their respective accesses to Circle. As I explained earlier, it’s very unlikely that thread B would hit the vulnerability window, so the execution proceeded without conflict and Jinx recorded all accesses to shared variables.

Initial exploration: Thread B makes its call to GetArea far outside the vulnerability window. Only accesses to shared memory are represented in this diagram.

Let’s concentrate on the shared variable l used by the mutex. Thread A accessed l four times, so there are four distinct locations to which thread B’s first access to l could be moved (this is done by selectively delaying some threads). Jinx might therefore run four possible simulations involving l. Here’s the first one:

Simulation 1: Thread B takes the lock before thread A's first access to l. Notice that thread A has to spin (the xchg instruction is repeated three times). It has to wait for thread B to release the lock. Thread B correctly calculates the old area of the circle.

Notice an interesting fact: The simulated execution is not just a simple splicing of two recorded tapes. Unlike in the original execution, when thread A tries to enter the critical section it is blocked by thread B and has to spin. A new choice of interleaving often leads to a different execution.

The second and third simulations result in the bug manifesting itself.

Simulation 2: Thread B accesses l between thread A's first and second access to l. This time thread B has to spin until thread A releases the lock. After that, thread A has to spin until thread B releases the lock. This simulation hits the bug -- thread B sees the new value of r and the old value of c.

Simulation 3: Thread B accesses l between thread A's second and third access to l. This simulation also hits the bug.

The fourth simulation misses the vulnerability window.

Simulation 4: Thread B accesses l between thread A's third and fourth access to l. Thread B has to spin, but the bug is not hit -- the new area of the circle is correctly calculated.

This is not bad. Suddenly the likelihood of producing buggy behavior shot up substantially.

Let’s do another back-of-the-envelope calculation. Assume that during each simulation round Jinx samples about 1ms of program execution and works on it for roughly 100ms. In order not to slow the program too much, Jinx may perform, say, 5 simulations rounds per second.

The probability that both accesses from our example will fall within the 1ms sampling period is 10-3. Jinx can perform the 103 simulation rounds necessary to reproduce the bug in 200 seconds or slightly more than 3 minutes. Compare this with the 10 days of unassisted testing we got previously.

What Happens Next?

Suppose that, after running for about 3 minutes, Jinx found a simulation that manifested a concurrency bug. The question is: What are the consequences of this bug? If the bug causes an assertion to fire, or results in an access fault within the same 1ms simulation period, Jinx will immediately re-run the same simulation in real time and cause the program to crash, thus proving beyond reasonable doubt that there was indeed a bug in the program.

But even if Jinx cannot trip the program, it will still rate different simulations according to its own heuristics, and it will run the one that scores the highest. In our example, even if the five simulations (including the original run) are rated the same, Jinx will pick one of them with the probability 1/5. Since two of them expose the bug, Jinx should be able to hit one in less than 10 minutes. Although the program won’t crash immediately, it will be running with corrupt data, which will eventually be discovered.

Conclusion

A dynamic concurrency debugger works by rearranging thread interleavings. In general, this method leads to an exponential explosion. Jinx uses two innovative techniques to make this approach feasible: It runs multiple simulations in parallel using virtual machine technology, and it rearranges thread timings based on shared-memory communications. The result is a concurrency bug accelerator that can speed testing up by orders of magnitude while running a program at almost normal speed.