data race


Concurrency bugs are hard to spot and hard to debug. They have a tendency to slip through the cracks of traditional testing. In this blog post I’ll go through a simple example of a buggy program and explain why the bug is so hard to reproduce and how a dynamic debugger like Jinx can accelerate this proces by orders of magnitude.

Toy Example

How do you show somebody that their program has a concurrency bug? You come up with an interleaving of threads that demonstrates the bug: “What if thread A is pre-empted between this instruction and the next one, and thread B jumps in and executes that instruction?”

Here’s a trivially incorrect example of a concurrent class:

class Circle {
   std::mutex mutex;
   double r; // radius
   double c; // circumference
public:
   Circle(double radius) : r(radius), c(2 * PI * r) {}
   void SetRadius(double radius) {
      mutex.lock();
      r = radius;
      mutex.unlock();
   }
   void UpdateCircumference() {
      mutex.lock();
      c = 2 * PI * r;
      mutex.unlock();
   }
   double GetArea() {
      mutex.lock();
      double area = 0.5 * c * r;
      mutex.unlock();
      return area;
   }
};

The problem arises if thread A is doing this:

circle.SetRadius(r1);
circle.UpdateCircumference();

while thread B is doing this:

area = circle.GetArea();

The question you might ask the unfortunate programmer during a code review (you have those, right?) is: “What if thread A is pre-empted between SetRadius and UpdateCircumference and thread B jumps in and executes GetArea?”. That’s the interleaving that exposes the atomicity violation bug in this code.

Code reviews don’t catch all bugs, so how about some defensive programming? Suppose the programmer adds an assertion, just in case:

area = circle.GetArea();
assert(area == PI * circle.GetRadius() * circle.GetRadius());

Will this help in uncovering the bug? (By the way, did you notice that this assertion has a concurrency bug of its own?)

Back-of-the-Envelope Calculation

Let’s assume that, during normal testing on a multicore machine, the non-atomic update in thread A executes once a second, and so does the the GetArea call in thread B.

We’ll hit the bug if thread B executes its _mutex.lock() in GetArea after thread A takes the lock in SetRadius and before it re-takes the lock in UpdateCircumference. That’s the window of vulnerability.

If thread B starts its call within the vulnerability window, it will read the new value of r and the old value of c. Notice that while thread A holds the lock, thread B is spinning, waiting for thread A to release the lock.

Most of the time in this window is spent acquiring and releasing the lock — which on average should take about 1000 processor cycles. On a 1GHz core that’s about 10-6s.

In our example, we would have to run the test continuously for about 106s in order to hit the bug. That’s about 10 days of testing. Of course your mileage may vary, but you see what the problem is. Concurrency bugs are very hard to reproduce.

Crash Early and Crash Often

A tool that could accelerate the detection of concurrency bugs would save the developers and testers a lot of time and effort. That’s what Corensic’s Jinx does. The basic idea is simple: Run multiple simulations of the execution of code fragments in parallel, carefully picking the candidate interleavings to increase the chances of hitting the vulnerability window.

The enabling technology in this case is virtualization. Jinx is able to periodically virtualize the machine for short periods of time in order to run its simulations.

XXX

During Simulation Round, Jinx runs several exploratory executions in virtual space, picks the most promising one, and runs it in real space. In this case, Jinx found an execution that leads to a crash.

Here’s what Jinx does when it takes control of the machine:

  1. It stops the running program and takes a snapshot of its state.
  2. It runs about a millisecond-worth of execution of the program in virtual space and records all memory accesses.
  3. It explores several alternative executions of the same fragment in virtual space. It ranks them and picks (“retires” in our jargon) the one that either crashes the program or is most likely to manifest a concurrency bug.

In the picture above, Jinx took a snapshot at point A and ran several simulations, one of them resulting in a (virtual) crash at point B. It then retired this particular simulation and ran it in real space, crashing the program at point C.

Of course, the most important part of the algorithm is the choice of executions for simulations. Random rescheduling of the threads is as unlikely to catch the bug as is regular testing. Exhaustive exploration of all possible interleavings would take astronomical time.

Smart Exploration

Imagine that you film the activities of all processors while executing a multithreaded program. Each frame corresponds to a single processor instruction. You’d end up with N tapes from N cores. Your task is to splice those tapes into one continuous movie. You can cut the tapes at arbitrary locations and interleave the fragments. Because of combinatorial explosion, the number of possible movies is overwhelming. That’s how many possible executions there are in a concurrent program. But Jinx has some additional information that cuts down this combinatorial space to a much more reasonable size.

As I mentioned before, during the first exploratory execution of a program fragment Jinx records all memory accesses. Of particular interest to us are accesses to shared memory — we recognize those when a thread touches a memory location that’s already been touched by another thread. Those shared memory accesses are our cue points — that’s where we cut the tape.

The idea is that the fragments between shared-memory accesses have no dependency on what other threads are doing, so any interleaving between them will have no impact on program behavior. What matters is the order in which the shared memory accesses happen, and their re-arrangements may produce distinct observable behaviors.

Since in a typical concurrent program there aren’t that many shared variables, the number of possible rearrangements is quite manageable.

Let’s go back to our example and study it in more detail. The simplest implementation of a mutex is based on one shared memory location, let’s call it l, and the accesse to it goes through some version of the exchange instruction (for instance LOCK CMPXCHG), which we will simply call xchg. From the point of Jinx, both lock and unlock look like xchg l. There are two more shared variables, circle.r and circle.c.

Suppose that, during the initial exploration, threads A and B executed their respective accesses to Circle. As I explained earlier, it’s very unlikely that thread B would hit the vulnerability window, so the execution proceeded without conflict and Jinx recorded all accesses to shared variables.

Initial exploration: Thread B makes its call to GetArea far outside the vulnerability window. Only accesses to shared memory are represented in this diagram.

Let’s concentrate on the shared variable l used by the mutex. Thread A accessed l four times, so there are four distinct locations to which thread B’s first access to l could be moved (this is done by selectively delaying some threads). Jinx might therefore run four possible simulations involving l. Here’s the first one:

Simulation 1: Thread B takes the lock before thread A's first access to l. Notice that thread A has to spin (the xchg instruction is repeated three times). It has to wait for thread B to release the lock. Thread B correctly calculates the old area of the circle.

Notice an interesting fact: The simulated execution is not just a simple splicing of two recorded tapes. Unlike in the original execution, when thread A tries to enter the critical section it is blocked by thread B and has to spin. A new choice of interleaving often leads to a different execution.

The second and third simulations result in the bug manifesting itself.

Simulation 2: Thread B accesses l between thread A's first and second access to l. This time thread B has to spin until thread A releases the lock. After that, thread A has to spin until thread B releases the lock. This simulation hits the bug -- thread B sees the new value of r and the old value of c.

Simulation 3: Thread B accesses l between thread A's second and third access to l. This simulation also hits the bug.

The fourth simulation misses the vulnerability window.

Simulation 4: Thread B accesses l between thread A's third and fourth access to l. Thread B has to spin, but the bug is not hit -- the new area of the circle is correctly calculated.

This is not bad. Suddenly the likelihood of producing buggy behavior shot up substantially.

Let’s do another back-of-the-envelope calculation. Assume that during each simulation round Jinx samples about 1ms of program execution and works on it for roughly 100ms. In order not to slow the program too much, Jinx may perform, say, 5 simulations rounds per second.

The probability that both accesses from our example will fall within the 1ms sampling period is 10-3. Jinx can perform the 103 simulation rounds necessary to reproduce the bug in 200 seconds or slightly more than 3 minutes. Compare this with the 10 days of unassisted testing we got previously.

What Happens Next?

Suppose that, after running for about 3 minutes, Jinx found a simulation that manifested a concurrency bug. The question is: What are the consequences of this bug? If the bug causes an assertion to fire, or results in an access fault within the same 1ms simulation period, Jinx will immediately re-run the same simulation in real time and cause the program to crash, thus proving beyond reasonable doubt that there was indeed a bug in the program.

But even if Jinx cannot trip the program, it will still rate different simulations according to its own heuristics, and it will run the one that scores the highest. In our example, even if the five simulations (including the original run) are rated the same, Jinx will pick one of them with the probability 1/5. Since two of them expose the bug, Jinx should be able to hit one in less than 10 minutes. Although the program won’t crash immediately, it will be running with corrupt data, which will eventually be discovered.

Conclusion

A dynamic concurrency debugger works by rearranging thread interleavings. In general, this method leads to an exponential explosion. Jinx uses two innovative techniques to make this approach feasible: It runs multiple simulations in parallel using virtual machine technology, and it rearranges thread timings based on shared-memory communications. The result is a concurrency bug accelerator that can speed testing up by orders of magnitude while running a program at almost normal speed.

Recently we’ve had yet another discussion at work about the definition of data race. You’d think that being in the business of detecting data races, we should know what a data race is. Unfortunately, it’s not so simple. Defining a data race is one of the harder problems faced by the designers of memory models (and concurrency bug checkers). Java had it figured out some time ago, and now C++11 has its own definition. But our product, Jinx, is language agnostic–it catches data races at the processor level.

The problem is that there is no one-to-one mapping between data races as defined by a higher-level language and the object code produced by the compiler. Consider a lock-free algorithm written in C++ using weak atomics (e.g., using memory_order_relaxed operations on atomic variables). When compiled for the x86 processor, the assembly instructions might look identical to those generated without atomics. But from the point of view of C++, the first version is race-free and the second is full of races. In this case the role of atomics is reduced to being nothing more than programmer’s annotation that says, “I know what I’m doing.” (And also, “This code is portable.”)

On the other hand, neither is there one-to-one correspondence between data races, as defined by a higher-level language, and concurrency bugs. Sometimes looking at how the code is actually executed on a multicore processor may flag concurrency errors that don’t manifest themselves as data races at the source level. And then there are programs that mix high-level languages and inline assembly, for which high-level analysis is impossible.

The purpose of this blog post is to gain some insight into data races at the processor level. I’ll start with the description of the x86 memory model and a definition of data race. I’ll explain the relationship between data races and sequential consistency. Then I’ll show you an example of a Linux spinlock, which contains a deliberate data race. This race can be classified as a non-triangular data race. I’ll explain what a triangular data race is and why it might be important in analyzing program execution.

The x86 Memory Model

In order to understand data races at the lowest level you have to dig into a processor’s memory model. For the x86, both Intel and AMD produced documents that provide some guidance, but fall short of defining a formal model. It took a lot of research and double guessing before academics and Intel engineers agreed on one particular model called Total Store Order (TSO).

The gist of this model is that, at least conceptually, all writes to memory from a particular core go through this core’s write buffer, which is invisible to other cores. The contents of this buffer find its way to global memory in FIFO order at some random intervals or as a result of special instructions. These special instructions are either LOCK’d instructions (the ones with the LOCK prefix) or fences (actually, only MFENCE is relevant here, because the model covers regular write-back memory, as opposed to write-combining memory or non-temporal instructions). Those instructions flush the buffer of the core that is executing them. It’s pretty surprising that such a simple model explains all the strange behaviors of the x86 processor with respect to memory loads and stores.

Total Store Order processor memory model

A few more details: Several x86 instructions are modeled as a series of actions (see our earlier blog entry). In particular, a LOCK’d instruction can be split into smaller constituents:

  1. LOCK, which flushes the buffer and locks access to global memory for all cores,
  2. loads, stores, and arithmetic actions specific to a particular instruction,
  3. UNLOCK, which flushes the buffer and unlocks access to global memory.

For instance, the execution of:

LOCK XCHG [x], EAX

may be modeled by:

  1. flushing the local write buffer,
  2. acquiring the lock on the bus,
  3. loading x from memory to a temporary register,
  4. writing EAX to x (through the local write buffer),
  5. copying the value from the temporary to EAX,
  6. flushing the write buffer (thus forcing the new value of x to be written to global memory), and
  7. releasing the bus lock.

Keep in mind that this is an idealized model, not necessarily corresponding to actual architecture or microcode.

x86 Data Races

Let’s start with the simplest definition of a data race and build things up from there. Imagine that you are single-stepping through a program that has multiple threads running on multiple cores. At each step you pick one core and execute the next instruction in its thread. For complex (but not LOCK’d) instructions, like INC[x], you might execute just one of its actions, e.g, the load of [x], at a time. This kind of step-by-step execution is called sequentially consistent, as long as all stores go directly to global memory.

A data race occurs if you can pick two actions accessing the same memory location on two separate cores to be executed next to each other, and at least one of them is a store. Notice that if these two actions can be made adjacent in a global execution order, they can also be executed in the reverse order. Now, if both actions were just memory loads, the order wouldn’t matter. This is the reflection of a more general principle that you can never have data races on immutable data. But if one of the accesses is a store, the order does matter.

This definition of data race is the foundation of dynamic race detectors, including Jinx. Jinx virtually (that is, using a virtual machine technology) re-runs parts of your program in a sequentially consistent manner, rearranging the threads in order to, among other things, bring shared memory accesses next to each other. If the program is properly synchronized, this should be impossible. For instance, one of the threads will be blocked waiting for a lock (this time I’m talking about higher-level locks, or critical sections) while the other is accessing a shared variable under the same lock.

Of course, the next question is about the locks themselves: They are implemented using the same instructions as the rest of the code, and use shared variables, so how come they don’t generate their own data races? Some race detectors recognize a few predefined patterns of instructions that correspond to locks in a particular language, library, or operating system. Again, Jinx tries to be agnostic.

Maybe unsurprisingly, higher level locks are implemented using LOCK’d instructions. So if we narrow the definition of data races to exclude conflicts between locked instructions, we should be fine. But are we allowed to play with definitions? How do we decide what is and what isn’t a data race? Are there some general principles to give us guidance? There is one I mentioned before called…

Sequential Consistency

Ideally, we would like to define data races so that they are always concurrency bugs. This isn’t easy, because we don’t have a definition of a bug. In a higher-level language, like C++, it’s enough to say that a data race leads to undefined behavior. But at a processor level, everything is defined. A processor won’t suddenly stop and say, “I don’t know what to do with this program.”

However, higher level languages give us a hint in terms of the DRF guarantee. DRF stands for Data-Race Freedom, and the guarantee is that if your program is free of data races, it is sequentially consistent (SC). Mind you, an SC program might still be buggy, and even have concurrency bugs, but at least we know how to reason about such programs. We run them in our minds, one action at a time, switching between threads at will, assuming that all writes go directly to main memory and are immediately visible to other threads. We know that this is not what really happens, but if a program is DRF, this is how it behaves.

Not all languages offer the DRF guarantee; C++ supports weak atomics that break SC without introducing data races. Nevertheless, DRF is a good goal.

What does it all mean for the x86? The definition of SC stands at the assembly level, although we have to allow for splitting complex instructions. For instance, INC[x] could be split into three actions: a load, an increment, and a store. If this instruction doesn’t have a LOCK prefix, another thread is free to execute its own actions in between the actions of the current instruction.

A TSO processor, like the x86, would be almost SC, if it weren’t for those pesky write buffers. Still, many memory-access conflicts on an x86 don’t break sequential consistency. For instance, because of the FIFO nature of the write buffers, two conflicting stores don’t break SC, and therefore are not considered a data race. Think of it this way: Inverting two adjacent stores from different threads doesn’t change anything because both stores go to separate write buffers. The external visibility of those stores depends on the order in which the buffers are discharged into global memory, and they are discharged in a sequentially consistent way.

So the only races on the x86 are those between loads and stores, except when both are LOCK’d (in which case they flush their buffers). That’s a good starting point, but a practical question remains: If we report all of those events as data races, how many of them will be considered false positives in real life? Unfortunately, the answer is: too many. It turns out that an efficient implementation of higher-order locks often contains deliberate races. It’s true that the acquisition of a lock usually starts with a LOCK’d intruction — often a CMPXCHG, also known as CAS– but the corresponding release might be implemented using an unsynchronized store. This is why the Jinx definition of a data race excludes conflicts in which even one of the instructions is LOCK’d.

Do we miss some legitimate data races because of that? Perhaps, but we haven’t seen such examples in practice yet. Interestingly, even this narrowed definition of data races flags some legitimate lock-free algorithms, in particular the Linux ultra-fast implementation of a spinlock.

Spinlock

Here’s an actual report produced by Jinx on Linux (Jinx can be used to debug the kernel):

Event: Data Race (3 times)
* Stack A
#0 0xffffffff81036d44 in __ticket_spin_lock at spinlock.h:65
#1 0xffffffff8159e7ee in arch_spin_lock at paravirt.h:744
#2 0xffffffff810909cd in futex_wake at futex.c:891
#3 0xffffffff81093288 in do_futex at futex.c:2602
#4 0xffffffff810933fb in sys_futex at futex.c:2636
* Stack B
#0 0xffffffff81036d74 in __ticket_spin_unlock at spinlock.h:101
#1 0xffffffff81090a58 in arch_spin_unlock at paravirt.h:760
#2 0xffffffff81093288 in do_futex at futex.c:2602
#3 0xffffffff810933fb in sys_futex at futex.c:2636

Two threads, A and B, ran into a race in spinlock.h. One access occurred at line 65 and the other at line 101 of that file. The two stack traces show that the caller, in both cases, was a futex, or a “fast mutex”– a Linux implementation of a thin lock.

Let’s have a look at spinlock.h. Line 65 is inside the function __ticket_spin_lock, which contains a bunch of inline assembly, which I’m not going to analyze here (although I will analyze similar code shortly):

 61 static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 62 {
 63     short inc = 0x0100;
 64
 65     asm volatile (
 66         LOCK_PREFIX "xaddw %w0, %1\n"
 67         "1:\t"
 68         "cmpb %h0, %b0\n\t"
 69         "je 2f\n\t"
 70         "rep ; nop\n\t"
 71         "movb %1, %b0\n\t"
 72         /* don't need lfence here, because loads are in-order */
 73         "jmp 1b\n"
 74         "2:"
 75         : "+Q" (inc), "+m" (lock->slock)
 76         :
 77         : "memory", "cc");
 78 }

The thing to notice is that this is the acquisition of a (ticketed) spinlock, the first instruction of which is indeed LOCK’d. Since Jinx doesn’t flag races with LOCK’d instructions, it must be the next instruction that accesses memory, movb, which is involved in the race. It reads a byte-sized value from memory without using a LOCK.

The other racing line, 101, is inside __ticket_spin_unlock:

 99 static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
100 {
101     asm volatile(UNLOCK_LOCK_PREFIX "incb %0"
102          : "+m" (lock->slock)
103          :
104          : "memory", "cc");
105 }

This is the release part of the ticketed spinlock, and it contains an increment instruction that accesses the memory (the same lock->slock variable as in the previous function). The macro, UNLOCK_LOCK_PREFIX, expands to nothing on processors other than Pentium Pro (which had an error). So we do have an unprotected store (part of incb) that races with an unprotected load (movb in the previous function). And yet, the code is correct. But to understand that, we need to learn about a more sophisticated notion of…

Triangular Data Races

Let’s go back to the x86 TSO memory model. The cores write into their private FIFO buffers. Data moves from these buffers into main memory. At any point in time only one buffer may discharge a previous write entry into memory, so those writes end up interleaved with each other. What’s important is that all processors see those writes appearing in main memory in the same order — the Total Store Order. (To be precise, the processor doing the write will see it immediately due to intra-processor forwarding, but that doesn’t matter in our case.)

Now imagine that you have an actual execution on the x86 and you want to show that it’s equivalent to an SC execution. You could do it, for instance, by rearranging instructions so that stores are moved up to the point where they were originally discharged into global memory. In order to preserve program order on individual processors, you would also have to move up the loads (and other instructions) that appear between those stores. But if you move a load forward too much, it might see a different store than it did in the original execution, and the new SC execution will not be equivalent to the original one.

A typical situation is when stores and loads cross each other, as in this program (x and y originally zero):

P1 P2
mov [x], 1 mov [y], 1
mov eax, [y] mov eax, [x]

If stores from both processors still sit in their respective buffers while the loads are performed, both eax registers will end up with zeros. If we try to move the stores forward to the point where the buffered writes were flushed, we’ll also have to push the loads, and they will end up reading ones instead of zeros. This is clearly a non-SC execution and the races it contains are definitely of interest to us.

Rearranging stores to create an equivalent execution with no buffering. Because program order must be preserved, the loads are rearranged too, resulting in different values being read. The SC execution is not equivalent to the original one. (Time flows from left to right.)

It turns out that all violations of SC on the x86 have one thing in common–a triangular data race. This is a triangular pattern (We have two of those in our example.):

P1 P2 Comment
mov [x], 1 Preceding store
No lock or fence
mov eax, [y] mov [y], 1 The race between a load and a store

More generally, a triangular data race occurs between a load action on one processor and a store action on another if the load is preceded by some other store. In the case above we have a race on y, and what makes it triangular is the odd store to x. The key is that this store may still sit in the write buffer when the racing load is made. If, on the other hand, there is any fence or LOCK/UNLOCK between the odd store and the racing load, we are fine, the race is not triangular.

What’s amazing is that it can be proven that any x86 program that doesn’t contain triangular data races is sequentially consistent. (For a formal definition of a triangular data race and a proof of the theorem, see Owen.) Armed with this knowledge, let’s revisit the spinlock example.

Spinlock and Triangular Races

The Linux ticketed spinlock is an elaboration on a basic spinlock. Here’s the (more readable) Intel assembly code for that simplified spinlock. There is a variable, lck, on which to spin, and its initial value is 1, corresponding to the lock being unclaimed. The address of lck is stored in eax.

Label Instruction Comment
acquire: lock dec [eax] atomic {
tmp = [lck] – 1
[lck] = tmp
flag = (tmp >= 0)
flush local write buffer }
jns enter if (flag) goto enter
spin: cmp [eax], 0 flag = ([lck] <= 0)
jle spin if (flag) goto spin
jmp acquire goto acquire
enter: the critical section starts here

To acquire the lock, we first atomically decrement lck. If the lock was free, the decremented value should be non-negative (actually, zero). If it isn’t, we start spinning, waiting for lck to become positive, and then try to acquire it again.

To release the lock, we set lck back to 1:

release: mov eax, 1 [lck] = 1

Notice that the load part of the instruction cmp [eax],0 is in a data race with mov eax,1. This is the race that is detected by Jinx.

But does this race break sequential consistency? In order to break SC, the race would have to be triangular, which means that the racing load would have to be preceded by some other store that’s not flushed. But the last store before entering the spin loop is part of the LOCK’d instruction, lock dec [eax], which flushes the write buffer. Since there is no triangular race, the spinlock is indeed sequentially consistent.

Conclusion

It’s eye opening to realize how much wiggle room there is in the definition of data race. We have the informal definition: Two simultaneous accesses to the same memory location, one of them being a write. Except that now we have to define “simultaneous,” which is a bit tricky. Simultaneity is characterized by lack of ordering–two events are simultaneous if it’s impossible to tell which came before the other.

Events on different processors may be ordered through synchronizing actions. At the lowest level, these are LOCK’d instructions and fences. However, there is no simple mapping between high-level-language definitions of data races and what is observed at the processor level during execution.

Ultimately, the reason for defining data races is to be able to identify one type of concurrency bugs. But bug detection is in general undecidable, so the best we can do is to define behaviors that are most likely related to programmers’ errors. Most programmers reason about concurrent programs using the sequentially consistent model of execution. This is why the most insidious concurrency bugs are related to the breakdown of sequential consistency. There is a linkage between data races and (the lack of) sequential consistency, which is expressed in terms of the DRF guarantee: A program with no data races behaves in a sequentially consistent way.

We can reverse the reasoning and define data races in a way that would guarantee the DRF property, and this is what is usually done. Seen from that perspective, it’s possible to strengthen the DRF guarantee on TSO processors. The new TRF (Triangular Race Freedom) guarantee states that a program with no triangular data races behaves in a sequentially consistent way and vice versa.

We may reasonably assume that non-triangular data races are not bugs but rather were deliberately coded, either directly by programmers (as was the case with the Linux spinlock), or by a compiler translating a sequentially consistent high-level program. The jury is still out, though, because outside of lock-free programming any race, triangular or not, is still very likely to be a bug.

Acknowledgments

I’d like to thank Anthony Williams, Dmitriy V’yukov, and my co-workers for valuable feedback.

Bibliography

  1. Scott Owens, Susmit Sarkar, Peter Sewell, A better x86 memory model: x86-TSO
  2. Scott Owens, Reasoning about the Implementation of Concurrency Abstractions on x86-TSO

An acquaintance recently observed that his company was concerned more with correctness bugs than with crashing bugs. For them, correctness bugs are more painful to debug than crashing bugs and they also get worse when you make the leap to shared-memory parallel programming (of which threaded programming is one instance). This is a pretty typical story, and it’s worth considering why correctness bugs are so problematic. Of particular interest is why shared-memory parallel programming makes them so much worse.

Corruption is worse than crashing

If a program stops and complains, that’s one thing: an automated system or a human can just start it again. There might be real-time or reliability goals for the program that aren’t being met, but at least you know something went wrong. If it keeps on going, but corrupts the output in some way, there’s a much bigger problem. Most software is based on an implicit assumption of the fail-stop model. This just means that errors in the code stop the program’s forward progress before erroneous data can become nonvolatile: written to disk or communicated to another process. You can see examples of applications that make this assumption all the time, when they trashed their registry entries or made their tables inconsistent, and consequently can’t run any more. At least one study finds that 7% of bugs aren’t fail-stop bugs… that is they produce bad outputs rather than crashing or stopping.

Correctness bugs are hard to fix, hard to detect, and costly

In my acquaintance’s domain, correctness bugs result in subtly wrong graphics. There, the fault tends to get detected, but the error-resolution process is lengthy, as much more of the codebase is suspect. These bugs are detected later in the development process, because earlier stages lack sufficient verification to detect corruption, but crashes are always obvious. In other domains, correctness bugs could mean incorrect chances of success when drilling a test well, an incorrect diagnosis, a wrongly sequenced chromosome, a corrupted database, poor search results, or an ill-advised financial trade. In some of the worst cases, humans or machines take the data out of computation and start committing real-world resources to the results of the computation. Things get really expensive when errors in computation get into nonvolatile storage inside people’s brains.

Some of these bugs are eventually detected, as when you compute an incorrect launch window for a rocket. Others, like choosing not to drill a test well in a fruitful spot, will likely never be detected. The hidden and obvious economic drags of correctness bugs include unhappy customers, missed opportunities, and wasted efforts.

Correctness bugs are more horrible as your software gets more non-deterministic

These bugs get more insidious when you make the leap to shared-memory parallel programming. Whereas in the single-threaded, synchronous world, a given input to a program always yielded the same output, you’re now faced with programs that may give a correct output only a certain percentage of the time. You expect 100%, but it’s not generally possible to prove that a shared-memory parallel program produces deterministic results. This causes trouble for crashing bugs because it makes repros rarer, and allows bugs to creep further through the development process before detection. It also renders debuggers less useful… You can’t just step through until you hit the bug, because sometimes the bug happens, and sometimes it doesn’t, and often the debugger stops it from happening.

The problem is much worse in the case of correctness bugs: both correctness and non-deterministic bugs tend to get detected late in development, and combined, they appear very late. Users of your code can act upon the results because they’re not crashing bugs, and the bug can slip through your test because it’s non-deterministic, even if you have a test case that could have caught the bug! Input coverage isn’t enough any more. How do we deal with this problem?

Dealing with the Problem

Rough classification of programming bugs. Adding assertions to your program can turn correctness bugs into crashing bugs. Limiting shared mutable state reduces non-determinism.

The forces of non-determinism and corruption!

There are three big parts to the answer to this problem. They don’t make the problem go away. They just stop your life from becoming a hell of customer escalations and expensive round-trips to QA.

Choose a simple parallel programming model

First, the way you use multiple cores will change the amount of non-determinism in your software. All the parts of your program that can avoid shared-mutable memory should probably avoid it. For most applications, you’ll never eliminate non-determinism altogether. After all, what is input if not an access of shared-mutable state? But, you can try to guarantee that like inputs yield like outputs by being careful about the sorts of shared-memory parallel programming you do.

Turn your correctness bugs into crashing bugs

Next, increase the likelihood of your bugs being crashing bugs, rather than correctness bugs. The trusty assert() is your friend here. Ship your software with your asserts enabled, and don’t tolerate failing asserts. They tell you where the bug was, and what it was. When you get that bug report from the field, you know exactly what you need to fix. Programs that need to continue to function in the face of failure can use software fault-tolerance approaches like running every computation twice and comparing results, with an optional tie-breaker. Some problems are intrinsically easy to verify, like factoring an integer. Most are not. During test you can use late-stage output verification to achieve fail-stop. Make sure everything does this. By doing this, you’ll change more bugs into being crashing bugs, rather than correctness bugs.

Amplify your testing

Lastly, amplify your non-deterministic bug-find rate. If you expect to deploy software that will run at a rate of tens of thousands of machine-hours per hour, that is, you have tens of thousands of users, or a few users using thousands of computers, you won’t match their ability to find non-deterministic bugs in software without amplifying your bug-find rate. This is a potential success disaster: shipping lots of software will make you realize how buggy it is.

Our goal with Jinx is to let the developer use one computer to test as thoroughly as thousands would, and to let QA engineers make one test pass count for a thousand. To learn more about what we’re doing with Jinx, check out our website!

Memory consistency models have an almost mythical aura. They can puzzle the most experienced programmers and lead to bugs that are incredibly hard to understand and fix. If you have written multithreaded code, it is likely that you have stumbled upon memory model woes. Chances are that you have also lost bets with your colleagues because of memory consistency model disputes. In this blog post I will discuss some of the rationale of why memory models were created and give some specific examples of how that affects you.

Memory Consistency Models

First things first, lets define what a memory model is. A memory consistency model defines what values a given read operation may return. The simplest memory model is the Sequential Consistency model, in which each execution behaves as if there were a single global sequence of memory operations, and the operations of a given thread appeared to all threads in the same order as they appear in the program (program order). It is the most natural model for normal humans to think about, because the execution behaves as if it were run on a multitasking uniprocessor.

Consider the following example:

Communicating processors

Two processors communicating through shared memory

The question is, what value can the read of data in P2 return? The most obvious answer here is 42. But what would you say if P2 observed the writes to data and flag in the opposite order? P2 could actually read data as “0″, which is surprising and not allowed by the sequential consistency memory model. And yet…

The main problem with sequential consistency is that it prevents systems from reordering memory operations to hide long latency operations and improve performance. For example, when a cache miss is being serviced, the processor could execute another memory access that comes after it in program order. That access may hit the cache and therefore complete earlier than the delayed access. Abandoning sequential consistency at the processor level results in dramatically improved performance.

But processors are not the only source of memory operation reordering. Many compiler optimizations effectively reorder code, e.g., loop-invariant code motion, common sub-expression elimination, etc. Furthermore, memory models of languages and memory models of the hardware they run on need not be the same. This is why compilers and synchronization libraries need to insert fences in the generated code. They have to map the language memory model to the hardware model. For example, Java and C++11 (the upcoming C++ standard) support memory models that guarantee sequential consistency for programs free of data races (although C++ also offers ways of relaxing sequential consistency without introducing races–the so called weak atomics).

Due the difficulty in improving performance under sequential consistency, a variety of “relaxed” memory models were conceived. For example, in the Weak Ordering memory model, there is no guarantee that a processor will observe another processor’s memory operations in program order. This is where a “memory fence” (a.k.a. “memory barrier”) comes into play. When a fence instruction is executed, it guarantees that all memory operations prior to it in program order are completed (and visible to other processors) before any operation after the fence in program order is allowed to proceed. You would be bored and stop reading if I described the multitude of consistency models in this post. However, I do encourage you to read more about memory models in this very nice tutorial by Sarita Adve and Kourosh Gharachorloo. Also, Paul McKenney’s paper has a nice table summarizing the ordering relaxations in modern microprocessors.

The x86 Memory Model

Now let’s talk about some of highlights of the x86 memory model. A big disclaimer first. This can change and probably does change between models, so it is always a good idea to check the manuals before endeavoring in sensitive code (8-8 Vol. 3 in this manual for Intel and Section 7.2 in this manual for AMD).

In a nutshell, recent implementations of the X86 ISA (P6 and on) follow, roughly, what is normally termed total-store order (TSO), which is stronger than “processor consistency” (and what people often think the x86 model to be). Its key ordering properties are:

  1. reads are not reordered with respect to reads;
  2. writes are not reordered with respect to reads that come earlier in the program order;
  3. writes are not reordered with respect to most writes (excluding, e.g., multiple writes implicit in string operations like REP MOVSB);
  4. reads may be reordered with respect to writes that come earlier in program order as long as those writes are to a different memory location;
  5. reads are not reordered with respect to I/O instructions, locked instructions and other serializing instructions.

There are no guarantees whatsoever of ordering between writes of different processors, the outcome of concurrent writes to the same memory location is non-deterministic. Increment instructions have no atomicity guarantees; moreover, even some write operations that update multiple bytes are not guaranteed to be atomic (see Andy’s blog post). For example, if a write operation to multiple bytes happen to cross a cache line boundary, the operation is not guaranteed to be atomic.

Here is an example of how the x86 memory model can get you in trouble:

The outcome t1 == 0 and t2 == 0 is possible on the x86

An execution whose final state is t1 == 0 and t2 == 0 is allowed. Such an outcome is unintuitive, non-sequentially consistent, because there is no serialized execution that leads to this state. In any serialized execution, there will be an assignment in one processor (A = 1 or B = 1) prior to a read in the other processor (t1 = B or t2 = A).

Another way to look at the problem is to try to build a happens-before graph of the execution. In this representation, each node is an executed instruction. A directed edge from instruction P to instruction Q is drawn if Q has observed the effects of P, and P has not observed the effects of Q, so P “happens-before” Q. (Note that, in the C++11 model, “happens-before” has a slightly different meaning than what is used in this post.)

Here is the happens-before graph for the example above when the outcome is t1 == 0 and t2 == 0:

An attempt to draw happens-before arrows for non-sequentially-consistent execution

Edge (1) exists because the read t1 = B in P1 did not observe the write B = 1 in P2. The same applies to edge (2). Edges (3) and (4) are there because of program order. Since there is a cycle in the happens-before graph, there is no serialized order that would satisfy the happens-before relationship. Therefore, the execution is not sequentially consistent. What happened in this example is that the read operation t1 = B in P1 proceeded before the write operation in A = 1 had a chance to complete and become visible to P2.

Here is another example of how the x86 memory model leads to surprising results:

The outcome t2 == 0 and t4 == 0 is possible on the x86

The snippet of execution above might lead to an unexpected state where t2 == t4 == 0. Lets look at this from the perspective of P1. This may happen because the processor can quickly forward the value from the pending write to A (A = 1) to the read of A (t1 = A) on the same processor. In the meanwhile, the other processor, P2, can’t access this pending write (because it’s still waiting in the P1’s private write buffer) and reads the old value (t4 = A). Note that this outcome cannot be explained by a simple reordering of t2 = B and t1 = A! Intuitive, huh?

One final example for you to noodle about. Consider a boiled-down version of the Dekker’s mutual exclusion algorithm for two threads:

Unintuitive outcome of a boiled-down Dekker's algorithm: both threads enter the critical section

The gist of the algorithm is to use two flag variables, one for each processor, flag1 and flag2. P1 sets flag1 when it is attempting to enter the critical section, it then checks if flag2 is set; if it is not set, it means P2 has not attempted to enter the critical section, so P1 can safely enter it. Because the x86 memory model allows reordering of loads with respect to earlier stores, the read of flag2 can proceed before the setting of flag1 is completed, which can lead to both processors entering critical sections, since P2 might have just set flag2!

That is it! I hope this helped you get a better grasp of what a memory consistency model is and understand a few of the key aspects of the x86 model. And, if you come across something that looks like a memory consistency bug, try building that happens-before graph to find cycles and remember to look at the manual 🙂 . Have fun!

Recently we’ve had a series of posts in this blog about so called benign data races that stirred a lot of controversy and led to numerous discussions here at Corensic. Two bastions formed, one claiming that no data race is benign, and the other claiming that data races are essential for performance. Then it turned out that we couldn’t even agree on the definition of a data race. In particular, the C++11 definition seemed to deviate from the established notions.

What Is a Data Race Anyway?

First of all, let’s make sure we know what we’re talking about. In current usage a data race is synonymous with a low-level data race, as opposed to a high-level race that involves either multiple memory locations, or multiple accesses per thread. Everybody agrees on the meaning of data conflict, which is multiple threads accessing the same memory location, at least one of them through a write. But a data conflict is not necessarily a data race. In order for it to become a race, one more condition must be true: the access has to be “simultaneous.”

Unfortunately, simultaneity is not a well defined term in concurrent systems. Leslie Lamport was the first to observe that a distributed system follows the rules of Special Relativity, with no independent notion of simultaneity, rather than those of Galilean Mechanics, with its absolute time. So, really, what defines a data race is up to your notion of simultaneity.

Maybe it’s easier to define what isn’t, rather than what is, simultaneous? Indeed, if we can tell which event happened before another event, we can be sure that they weren’t simultaneous. Hence the use of the famous “happened before” relationship in defining data races. In Special Relativity this kind of relationship is established by the exchange of messages, which can travel no faster than the speed of light. The act of sending a message always happens before the act of receiving the same message. In concurrent programming this kind of connection is made using synchronizing actions. Hence an alternative definition of a data race: A memory conflict without intervening synchronization.

The simplest examples of synchronizing actions are the taking and the releasing of a lock. Imagine two threads executing this code:

  mutex.lock();
  x = x + 1;
  mutex.unlock();

In any actual execution, accesses to the shared variable x from the two threads will be separated by a synchronization. The happens-before (HB) arrow will always go from one thread releasing the lock to the other thread acquiring it. For instance in:

# Thread 1 Thread 2
1 mutex.lock();
2 x = x + 1;
3 mutex.unlock();
4 mutex.lock();
5 x = x + 1;
6 mutex.unlock();

the HB arrow goes from 3 to 4, clearly separating the conflicting accesses in 2 and 5.

Notice the careful choice of words: “actual execution.” The following execution that contains a race can never happen, provided the mutex indeed guarantees mutual exclusion:

# Thread 1 Thread 2
1 mutex.lock();
2 mutex.lock();
3 x = x + 1; x = x + 1;
4 mutex.unlock();
5 mutex.unlock();

It turns out that the selection of possible executions plays an important role in the definition of a data race. In every memory model I know of, only sequentially consistent executions are tried in testing for data races. Notice that non-sequentially-consistent executions may actually happen, but they do not enter the data-race test.

In fact, most languages try to provide the so called DRF (Data Race Free) guarantee, which states that all executions of data-race-free programs are sequentially consistent. Don’t be alarmed by the apparent circularity of the argument: you start with sequentially consistent executions to prove data-race freedom and, if you don’t find any data races, you conclude that all executions are sequentially consistent. But if you do find a data race this way, then you know that non-sequentially-consistent executions are also possible.

DRF Guarantee

DRF guarantee. If there are no data races for sequentially consistent executions, there are no non-sequentially consistent executions. But if there are data races for sequentially consistent executions, the non-sequentially consistent executions are possible.

As you can see, in order to define a data race you have to precisely define what you mean by “simultaneous,” or by “synchronization,” and you have to specify to which executions your definition may be applied.

The Java Memory Model

In Java, besides traditional mutexes that are accessed through “synchronized” methods, there is another synchronization device called a volatile variable. Any access to a volatile variable is considered a synchronization action. You can draw happens-before arrows not only between consecutive unlocks and locks of the same object, but also between consecutive accesses to a volatile variable. With this extension in mind, Java offers the the traditional DRF guarantee. The semantics of data-race free programs is well defined in terms of sequential consistency thus making every Java programmer happy.

But Java didn’t stop there, it also attempted to provide at least some modicum of semantics for programs with data races. The idea is noble–as long as programmers are human, they will write buggy programs. It’s easy to proclaim that any program with data races exhibits undefined behavior, but if this undefined behavior results in serious security loopholes, people get really nervous. So what the Java memory model guarantees on top of DRF is that the undefined behavior resulting from data races cannot lead to out-of-thin-air values appearing in your program (for instance, security credentials for an intruder).

It is now widely recognized that this attempt to define the semantics of data races has failed, and the Java memory model is broken (I’m citing Hans Boehm here).

The C++ Memory Model

Why is it so important to have a good definition of a data race? Is it because of the DRF guarantee? That seems to be the motivation behind the Java memory model. The absence of data races defines a subset of programs that are sequentially consistent and therefore have well-defined semantics. But these two properties: being sequentially consistent and having well-defined semantics are not necessarily the same. After all, Java tried (albeit unsuccessfully) to define semantics for non sequentially consistent programs.

So C++ chose a slightly different approach. The C++ memory model is based on partitioning all programs into three categories:

  1. Sequentially consistent,
  2. Non-sequentially consistent, but with defined semantics, and
  3. Incorrect programs with undefined semantics

The first category is very similar to race-free Java programs. The place of Java volatile is taken by C++11 default atomic. The word “default” is crucial here, as we’ll see in a moment. Just like in Java, the DRF guarantee holds for those programs.

It’s the second category that’s causing all the controversy. It was introduced not so much for security as for performance reasons. Sequential consistency is expensive on most multiprocessors. This is why many C++ programmers currently resort to “benign” data races, even at the risk of undefined behavior. Hans Boehm’s paper, How to miscompile programs with “benign” data races, delivered a deathblow to such approaches. He showed, example by example, how legitimate compiler optimizations may wreak havoc on programs with “benign” data races.

Fortunately, C++11 lets you relax sequential consistency in a controlled way, which combines high performance with the safety of well-defined (if complex) semantics. So the second category of C++ programs use atomic variables with relaxed memory ordering semantics. Here’s some typical syntax taken from my previous blog post:

std::atomic<int> owner = 0
...
owner.load(memory_order_relaxed);

And here’s the controversial part: According to the C++ memory model, relaxed memory operations, like the above load, don’t contribute to data races, even though they are not considered synchronization actions. Remember one of the versions of the definition of a data race: Conflicting actions without intervening synchronization? That definition doesn’t work any more.

The C++ Standard decided that only conflicts for which there is no defined semantics are called data races.

Notice that some forms of relaxed atomics may introduce synchronization. For instance, a write access with memory_order_release “happens before” another access with memory_order_acquire, if the latter follows the former in a particular execution (but not if they are reversed!).

Conclusion

What does it all mean for the C++11 programmer? It means that there no longer is an excuse for data races. If you need benign data races for performance, rewrite your code using weak atomics. Weak atomics give you the same kind of performance as benign data races but they have well defined semantics. Traditional “benign” races are likely to be broken by optimizing compilers or on tricky architectures. But if you use weak atomics, the compiler will apply whatever means necessary to enforce the correct semantics, and your program will always execute correctly. It will even naturally align atomic variables to avoid torn reads and writes.

What’s more, since C++11 has well defined memory semantics, compiler writers are no longer forced to be conservative with their optimizations. If the programmer doesn’t specifically mark shared variables as atomic, the compiler is free to optimize code as if it were single-threaded. So all those clever tricks with benign data races are no longer guaranteed to work, even on relatively simple architectures, like the x86. For instance, compiler is free to use your lossy counter or a binary flag for its own temporary storage, as long as it restores it back later. If other threads access those variables through racy code, they might see arbitrary values as part of the “undefined behavior.” You have been warned!

I just came back HotPar 2011 workshop in Berkeley. It was a very intense conference: presentations ran all day long and there were discussion assignments (complete with written reports) at lunch. Not much time for blogging or even tweeting.

What struck me most was the dominance of hybrid architectures– all the way from hardware to programming models. Questions ranged from, “How to best take advantage of various architectures?” to “How to unburden the programmer from making low level optimizations?” These are hard problems: there were many interesting proposals but no real breakthroughs. In the ideal world the programmer would just write a sequential algorithm (possibly with some parallelization hints) and the system would take care of parallelizing it appropriately for a particular architecture. As it is now, if the programmer doesn’t hand tune the algorithm to the architecture, the program won’t take full advantage of the computing capabilities of hardware. The details of the architecture leak into the programming model making it both hard to write and maintain as well as hard to port between different systems.

One area where there was some convergence of ideas was the way fragmented memory spaces were presented to the programmer. GPUs and hybrid multicores tend to have separate address spaces for each core, as in NUMA (Non-Uniform Memory Access) architectures. A component GPU performs its calculations using its own local memory. It may exchange data with other cores, or access global memory, but these are separate actions. In a way, a hybrid multicore resembles a cluster of computers connected through a network. Writing programs for such systems requires a lot of effort that is related more to the mechanics of communicating and moving data between cores and less to the problem domain. Hence the popularity of the idea of exposing to the programmer a uniform address space that combines main memory and local on-chip memories. That doesn’t mean that all areas of that global address space must be treated equally–that could nullify many of the performance advantages of hybrid chips. So, at some level, the programmer should be able to specify the distribution of data structures between cores. These are not new ideas. In distributed computing programmers have been using PGAS (Partitioned Global Address Space) for some time, and languages like Cray’s Chapel provide clean ways of separating algorithms from data partitioning (see my blog about HPCS). I talked to several speakers at HotPar, and surprisingly few of them were aware of the Chapel effort. And, as far as I know, there was nobody from Cray at the conference.

Of particular interest to me was the presentation by Hans Boehm, “How to Miscompile Programs with ‘Benign’ Data Races.” I should confess now that after publishing my blog post about benign data races I got in contact with Hans and he gave me a peek at this paper. Essentially, things are as bad or even worse than what Andy and I described in this blog. In particular Hans gave an example that showed how a redundant racy write (a store of the same value twice, from two different threads) may wreak havoc on your program. Read the paper for gory details.

Bibliography

All papers presented at HotPar are now available online at the conference web site.

I’ve been working on a new malloc implementation for Jinx recently. The last time that a developer at Corensic wrote a version of malloc(), I asked him why he didn’t use a great off-the shelf malloc like tcmalloc. Now it’s my turn, and I also wrote my own. I guess you can accuse me of failing to learn from my mistakes; I’m certainly a hypocrite!

In my defense, a few things about our hypervisor make using someone else’s malloc a pain. First, we’re effectively an embedded system working inside a small chunk of memory allocated in a device driver. The hypervisor works in its own virtual address space, but we never modify that address space. We do some large allocations along the way, but we also have some long-lived smaller memory allocations, so fragmentation is a concern. We code our hypervisor in C. In addition to these constraints, we have one opportunity: we have no pre-emption in our threading package.

I decided to create a pretty simple global allocator, but with the addition of per-physical-processor freelists. Since there’s no pre-emption in our system, we can put freelists in local storage on physical processors, and access the freelists without a lock, without fear of migration.

An example of the global heap data structure follows:

Example of the Jinx hypervisor heap data structure.

An example Jinx heap data structure.

The heap itself is a linked list of regions from HEAD to TAIL. Each element on the linked list is either in-use or free, or is the head or tail of the list. The heap linked list may have no two adjacent free regions (they should be coalesced into one free region). In addition, there are freelist heads for powers of two. For a given freelist head with size S, every item of size P on that freelist obeys S <= P < S * 2.  At the head of every region, there’s a data structure that looks like this:

struct region_header {
    uint64 alloc_state : 2;
    uint64 bytes_to_previous_header : 31;
    uint64 bytes_to_next_header : 31;
    LIST_ENTRY(region_header) freelist;
};

That LIST_SIZE LIST_ENTRY is a FreeBSD macro that represents doubly-linked-list linkage.  alloc_state has one of these values:

enum { ALLOC_STATE_INVALID, ALLOC_STATE_FREE, ALLOC_STATE_ALLOCED,
       ALLOC_STATE_SENTINEL };

Sentinel is used for HEAD and TAIL.

To free an item into the heap, something like this occurs:

free_item(item, heap) {
    item->alloc_state = ALLOC_STATE_FREE;
    coalesce(item, right_neighbor(item));
    coalesce(left_neighbor(item), item);
}

This preserves the heap invariant that there exist no two adjacent free regions.

We use one giant lock for all items in the heap, relying on per-processor freelists to provide concurrency.  Those freelists reuse the freelist field of the header structure for linkage.  Those fields are protected by the global heap lock only as long as alloc_state == ALLOC_STATE_FREE.

When I attempted to make this heap thread-safe, though, I wrote this:

free_item(item, heap) {
    item->alloc_state = ALLOC_STATE_FREE;
    lock();
    coalesce(item, right_neighbor(item));
    coalesce(left_neighbor(item), item);
    unlock();
}

A bug in this code, like many concurrency errors, resulted in intermittent, unusual, crashes and corruption.  It was rare enough that I almost checked it in, but was saved by one of the many unit tests we run prior to commit crashing once.

Now, in retrospect, setting the alloc_state field outside of the lock was a pretty obvious bug.  Imagine that a manipulator of our left-hand neighbor, holding the lock, observed that our region had become free, and coalesced its region with ours. That would be catastrophic, as we’d end up with a pointer to a region that no longer was!  Because regions consult their neighbors’ allocation status during coalescing, that field must be protected by that lock.

However, there’s a more subtle bug lurking here. Consider a different structure definition:

struct region_header {
    uint64 available: 2;
    uint64 alloc_state : 2;
    uint64 bytes_to_previous_header : 30;
    uint64 bytes_to_next_header : 30;
    LIST_ENTRY(region_header) freelist;
};

Let’s say that “available” field is available for use by the user of the malloc interface.  Now we have a problem with non-atomicity of modification. That is, even if that field is owned by the owner of the memory block, not by the holder of the lock, that field can not be modified using normal operations.

void free_item(item, heap) {
    item->available = 2;
    lock();
    coalesce(item, right_neighbor(item));
    coalesce(left_neighbor(item), item);
    unlock();
}

The first part of that code, compiled into x86-64 instructions, might look like

and ~$3, (%rax)
or $2, (%rax)

As covered in a recent blog post about non-serializable single instructions, variants of these instructions without the LOCK prefix are not atomic, so there’s ample opportunity here to clobber ongoing modifications to bitfields that share our storage.  My coworker David points out that it’s more likely to get compiled as this:

mov (%rax), %rdx
and ~$3, %rdx
or $2, %rdx
mov %rdx, (%rax)

Either way, this is a non-atomic update!

This is yet another occasion where local reasoning about code is foiled by concurrency.  Unlike single-threaded code, you have to know the exact structure you’re dealing with to know if modifying available is legal. In addition, you need to know what sort of atomicity your compiler or language guarantees when you’re accessing variables. For example, would this be OK?

struct region_header {
    uint8 available;
    uint8 alloc_state;
    uint16 bytes_to_previous_header;
    uint16 bytes_to_next_header;
    LIST_ENTRY(region_header) freelist;
};

When I modify “available”, how do I know that “bytes_to_previous_header” doesn’t take a read-modify-write trip with me?  Can I use that field without acquiring the global heap lock?  I sent this question out to the rest of our development team, and our UW CSE faculty:  in what languages is a byte read and write guaranteed to be isolated from adjacent fields?  How about a 16 bit number?  How about numbers that are the natural register width?  How about larger?

Consensus was that all languages on all commonly used architectures guarantee that if one reads or writes a byte, adjacent fields don’t go along for the ride, and that this is true, outside of bitfields, for all types up to the register size of a CPU.  Programs would clearly be too hard to reason about were this not true.

Bartosz illustrated this to me for C++ with some language from the C++ spec:

(1.7.3) A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. […] Two threads of execution (1.10) can update and access separate memory locations without interfering with each other.
(1.7.4) [ Note: Thus a bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields of non-zero width. —end note ]
(1.7.5) [ Example: A structure declared as

  struct {
    char a;
    int b:5,
    c:11,
    :0,
    d:8;
    struct {int ee:8;} e;
  }

contains four separate memory locations: The field a and bit-fields d and e.ee are each separate memory locations, and can be modified concurrently without interfering with each other. The bit-fields b and c together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but b and a, for example, can be. —end example ]

I would be delighted to hear about it if people have examples from computers, compilers, and languages, old and new, where this is not true.

One last note on this topic:  In the FreeBSD kernel, the developers have a good habit of documenting what lock protects which field of a structure, via a locking guide.  In their world, that structure might look like this:

/*
 * Locking guide:
 * L:  this field is modified only under the global malloc lock.
 * L/F: When alloc_state == ALLOC_STATE_FREE, L. Else, unlocked field.
 */
struct region_header {
    uint64 alloc_state : 2;                   /* L/F */
    uint64 bytes_to_previous_header : 31;     /* L */
    uint64 bytes_to_next_header : 31;         /* L */
    LIST_ENTRY(region_header) freelist;       /* L/F */
};

In this world, the author of the locking guide would be reprimanded for trying to apply locking rule L/F to the field alloc_state or the hypothetical available field, while other members of the same bitfield had locking rule L.  Any time you document who locks what in a structure, the granularity may not be smaller than that which your compiler and your architecture guarantees, assuming you’re using non-atomic operations to modify the fields!  This is a great practice that has served Corensic well in the development of its hypervisor.  My non-adherence to this practice cost me a few hours in debugging!

So, in summary:

  • The combination of bitfields + concurrency is dangerous, because fields of a structure are implicitly intertwined with adjacent fields.
  • Those problems could be mitigated somewhat by wrapping all bitfields in their own structure.  If you don’t do this (as we don’t), then you are denied local reasoning about fields of structures.
  • For these reasons, converting code into use of bitfields due to space pressure is also an operation that cannot be performed with only local reasoning.  When you change a field of structure to a bitfield, you have to look at uses of that field to ensure that you don’t violate interference assumptions in the code.
  • Use of a locking guide in your C structures will alert you to many of these problems.

In a previous post to this blog, “Dealing with Benign Data Races the C++ Way”, Bartosz examined an apparently benign data race involving a recursively acquirable mutex. In his analysis, he suggested a highly unlikely, but legal, transformation that a compiler could apply to make the race in the mutex less-than-benign. Bartosz’s example transformation involved temporarily writing an arbitrary value (say 42) to a variable (the owner field) that was always guarded by a lock, except for a racy read during acquisition of the recursive mutex.

In his comment on Bartosz’s post, Hans Boehm described a more likely transformation that is very much like one that happened to me, once:

I think a more likely breaking “transformation” here is that the owner field may be misaligned or, for an embedded processor, larger than the atomically updateable word size. In either case, updates could become visible to observers in two sections. This means the observed value could be different from any value actually stored. In unlikely cases, it could match the thread id of the querying thread.

Compilers generally make you jump through hoops if you want to misalign fields inside structures, but it can be done. Through a sequence of terrible accidents, my coworkers and I once produced a piece of code that, while more obscure, was equivalent to the following:

#pragma pack (push, 1)  // May cause fields of RecursiveMutex
                        // not to be naturally aligned.

class RecursiveMutex {
public:
    RecursiveMutex() : owner(0), count(0), lock() {}
    bool TryEnter();
    void Exit();

private:
    char garbage;
    int owner;
    int count;
    Lock lock;
};

bool RecursiveMutex::TryEnter() {
    if (owner == /* get thread id */) {  // Racy read on this line.
        count += 1;
        return true;
    }

    if (lock.TryEnter()) {
        owner = /* get thread id */;
        return true;
    }
    return false;
}

void RecursiveMutex::Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner = 0;
    lock.Exit();
}

With the compiler we were using, the #pragma pack (push, 1) declaration caused the fields of RecursiveMutex to be laid out in memory without any padding that would allow for natural field alignment. As a result, the reads and writes of the “owner” field were not guaranteed to be atomic by our compiler and architecture (AMD64). Worse still, because the x86 and AMD64 architectures provide hardware support for unaligned memory accesses, this bug could manifest itself as a non-serializable execution of just two machine instructions.

For example, suppose the assignment “owner = 0” in Exit() compiled to the following x86/AMD64 instruction:

MOV [ecx], 0

Further, suppose that the racy read of “owner” in TryAcquire compiled to this instruction:

MOV eax, [ecx]

If the owner field is not naturally aligned, and the RecursiveMutex instance starts at address 0x1000, the following non-serializable interleaving of two threads is possible:

Seq# Thread 1 calling Exit Thread 2 calling TryAcquire
1 Store the first 3 bytes of the constant 0 into [0x1001:0x1003]
2 Read [0x1001:0x1003] into the first 3 bytes of eax
3 Read [0x1004] into the last byte of eax
4 Store the last byte of constant 0 into [0x1004]

In this case, thread 2 will see a value that should never have been observable in memory: a partial read of the first thread’s write, combined with a partial read of the data present before the write. This value may be equal to the second thread’s ID. However unlikely that may be, the consequences of such a coincidence could be disastrous.

Interestingly, even though the atomic forms of the x86/AMD64 read-modify-write instructions can atomically access data that is not naturally aligned, there is no atomic form for other operations, like MOV, when working with unaligned data.

What is the lesson for a C++ programmer who, for one reason or another, wants to play with lock-free code? If you have access to a C++11 compiler, use atomics. The owner field should be defined as atomic, even if you access it through relaxed memory operations, such as load(memory_order_relaxed). An atomic variable is guaranteed to be naturally aligned. For those of us still using the obsolete compilers, make sure you’re not forcing the compiler to misalign your shared data.

Caveat

While the x86 and AMD64 architectures may treat non-atomically any unlocked memory operation that is not naturally aligned, implementations tend to only show this behavior at certain larger alignment boundaries.  For example, some processors treat all memory accesses that fall within a single 64-byte cache line atomically.  Others only do so for entries that fall within a 16-byte store-buffer entry.  This makes the unsynchronized sharing of data that is not naturally aligned extremely dangerous.  It may work on one machine, and not on another.  At Corensic, we have observed processors that split at 16-byte boundaries and at 64-byte boundaries.

Bibliography

  1. Scott Owens, Susmit Sarkar, Peter Sewell, A Better x86 Memory Model: x86-TSO
  2. Relaxed-Memory Concurrency, (includes papers on non-x86 architectures)

Data races lead to undefined behavior; but how bad can they really be? In my previous post I talked about benign data races and I gave several examples taken from the Windows kernel. Those examples worked because the kernel was compiled with a specific compiler for a specific processor. But in general, if you want your code to be portable, you can’t have data races, period.

You just cannot reason about something that is specifically defined to be “undefined.” So, obviously, you cannot prove the correctness of a program that has data races. However, very few people engage in proofs of correctness. In most cases the argument goes, “I can’t see how this can fail, therefore it must be correct” (maybe not in so many words). I call this “proof by lack of imagination.” If you want to become a concurrency expert, you have to constantly stretch your imagination. So let’s do a few stretches.

One of the readers of my previous post, Duarte Nunes, posted a very interesting example of a benign data race. Here’s the code:

int owner = 0;
int count = 0;
Lock lock;

bool TryEnter() {
    if (owner == /* get thread id */) {
        count += 1;
        return true;
    }

    if (lock.TryEnter()) {
        owner = /* get thread id */;
        return true;
    }
    return false;
}

void Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner = 0;
    lock.Exit();
}

I highlighted in blue the parts that are executed under the lock (in a correct program, ‘Exit’ will always be called after the lock has been acquired). Notice that the variable ‘count’ is always accessed under the lock, so no data races there. However, the variable ‘owner’ may be read outside of the lock– I highlighted that part of code in red. That’s the data race we are talking about.

Try to analyze this code and imagine what could go wrong. Notice that the compiler or the processor can’t be too malicious. The code still has to work correctly if the data race is removed, for instance if the racy read is put under the lock.

Here’s an attempt at the “proof” of correctness. First, Duarte observed that “The valid values for the ‘owner’ variable are zero and the id of any thread in the process.” That sort of makes sense, doesn’t it? Now, the only way the racy read can have any effect is if the value of ‘owner’ is equal to the current thread’s ID. But that’s exactly the value that could have been written only by the current thread– and under the lock.

There are two possibilities when reading ‘owner’: either we are still under that lock, in which case the read is not at all racy; or we have already released the lock. But the release of the lock happens only after the current thread zeroes ‘owner.’

Notice that this is a single-thread analysis and, within a single thread, events must be ordered (no need to discuss memory fences or acquire/release semantics at this point). A read following a write in the same thread cannot see the value that was there before the write. That would break regular single-threaded programs. Of course, other threads may have overwritten this zero with their own thread IDs, but never with the current thread ID. Or so the story goes…

Brace yourself: Here’s an example how a compiler or the processor may “optimize” the program:

void Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner = 42;
    owner = 0;
    lock.Exit();
}

You might argue that this is insane and no compiler in its right mind would do a thing like this; but the truth is: It’s a legal program transformation. The effect of this modification is definitely not observable in the current thread. Neither is it observable by other threads in the absence of data races. Now, the unfortunate thread whose ID just happens to be 42 might observe this value and take the wrong turn. But it can only observe it through a racy read. For instance, it would never see this value if it read ‘owner’ under the lock. Moreover, it would never see it if the variable ‘owner’ were defined as ‘atomic’:

std::atomic<int> owner = 0

Stores and loads of atomic variables are, by default, sequentially consistent. Unfortunately sequential consistency, even on an x86, requires memory fences, which can be quite costly. It would definitely be an overkill in our example. So here’s the trick: Tell the compiler to forgo sequential consistency on a per read/write basis. For instance, here’s how you read an atomic variable without imposing any ordering constraints:

owner.load(memory_order_relaxed)

Such ‘relaxed’ operations will not introduce any memory fences– at least not on any processor I know about.

Here’s the version of the code that is exactly equivalent to the original, except that it’s correct and portable:

std::atomic<int> owner = 0;
int count = 0;
Lock lock;

bool TryEnter() {
    if (owner.load(memory_order_relaxed) == /* get thread id */) {
        count += 1;
        return true;
    }

    if (lock.TryEnter()) {
        owner.store(/* get thread id */, memory_order_relaxed);
        return true;
    }
    return false;
}

void Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner.store(0, memory_order_relaxed);
    lock.Exit();
}

So what has changed? Can’t the compiler still do the same dirty trick on us, and momentarily store 42 in the ‘owner’ variable? No, it can’t! Since the variable is declared ‘atomic,’ the compiler can no longer assume that the write can’t be observed by other threads.

The new version has no data races: The (draft) Standard specifically states that ‘atomic’ variables don’t contribute to data races, even in their most relaxed form.

C++ Draft Standard, (1.10.5):
[…] “Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races.

With those changes, I believe that our “proof” of correctness may actually be turned into a more rigorous proof using the axioms of the C++ memory model (although I’m not prepared to present one). We can have our cake (correct, portable code) and eat it too (no loss of performance). And, by the way, the same trick may be used in the case of lossy counters from my previous post.

Warning: I do not recommend this style of coding, or the use of weak atomics, to anybody who is not implementing operating system kernels or concurrency libraries.

Acknowledgments

I’d like to thank Luis Ceze, Hans Boehm, and Anthony Williams for useful remarks and for verifying my assumptions about the C++ memory model.

Bibliography

  1. C++ Draft Standard

Announcement

In the effort to raise the awareness about concurrent programming, I’ll be hosting a free webinar, “The Language of Concurrency,” Tuesday May 24, 2011 at 12pm PDT. It will be a broad overview of the domain and the explanation of common terms, such as message passing, data races, atomics, sequential consistency, etc. Nothing sophisticated, but you might recommend it to your coworkers who are still learning the ropes or need a refresher. You can pre-register here, get an email reminder, and download the slides. If this is successful, I’ll start working on some more advanced topics (suggestions are welcome).

Can a data race not be a bug? In the strictest sense I would say it’s always a bug. A correct program written in a high-level language should run the same way on every processor present, past, and future. But there is no proscription, or even a convention, about what a processor should (or shouldn’t) do when it encounters a race. This is usually described in higher-level language specs by the ominous phrase: “undefined behavior.” A data race could legitimately reprogram your BIOS, wipe out your disk, and stop the processor’s fan causing a multi-core meltdown.

Data race: Multiple threads accessing the same memory location without intervening synchronization, with at least one thread performing a write.

However, if your program is only designed to run on a particular family of processor, say the x86, you might allow certain types of data races for the sake of performance. And as your program matures, i.e., goes through many cycles of testing and debugging, the proportion of buggy races to benign races keeps decreasing. This becomes a real problem if you are using a data-race detection tool that cannot distinguish between the two. You get swamped by false positives.

Microsoft Research encountered and dealt with this problem when running their race detector called DataCollider on the Windows kernel (see Bibliography). Their program found 25 actual bugs, and almost an order of magnitude more benign data races. I’ll summarize their methodology and discuss their findings about benign data races.

Data Races in Windows Kernel

The idea of the program is very simple. Put a hardware breakpoint on a shared memory access and wait for one of the threads to stumble upon it. This is a code breakpoint, which is triggered when the particular code location is executed. The x86 also supports data breakpoints, which are triggered when the program accesses a specific memory location. So when a thread hits the code breakpoint, DataCollider installs a data breakpoint on the location the thread was just accessing. It then stalls the current thread and let all other threads run. If any one of them hits the data breakpoint, it’s a race (as long as one of the accesses is a write). Consider this: If there was any synchronization between the two accesses, the second thread would have been blocked from accessing that location. Since it wasn’t, we have a classic data race.

Notice that this method might not catch all data races, but it doesn’t produce false positives. Except, of course, when the race is considered benign.

There are other interesting details of the algorithm. One is the choice of code locations for installing breakpoints. DataCollider first analyzes the program’s assembly code to create a pool of memory accesses. It discards all thread-local accesses and explicitly synchronized instructions (for instance, the ones with the LOCK prefix). It then randomly picks locations for breakpoints from this pool. Notice that rarely executed paths are as likely to be sampled as the frequently executed ones. This is important because data races often hide in less frequented places.

Pruning Benign Races

90% of data races caught by DataCollider in the Windows kernel were benign. For several reasons it’s hard to say how general this result is. First, the kernel had already been tested and debugged for some time, so many low-hanging concurrency bugs have been picked. Operating system kernels are highly optimized for a particular processor and might use all kinds of tricks to improve performance. Finally, kernels often use unusual synchronization strategies. Still, it’s interesting to see what shape benign data races take.

It turns out that half of false positives came from lossy counters. There are many places where statistics are gathered: counting certain kinds of events, either for reporting or for performance enhancements. In those situations losing a few increments is of no relevance. However not all counters are lossy and, for instance, a data race in reference counting is a serious bug. DataCollider uses simple heuristic to detect lossy counters–they are the ones that are always incremented. A reference counter, on the other hand, is as often incremented as decremented.

Another benign race happens when one thread reads a particular bit in a bitfield while another thread updates another bit. A bit update is a read-modify-write (RMW) sequence: The thread reads the previous value of the bitfield, modifies one bit, and writes the whole bitfield back. Other bits are overwritten in the process too, but their new values are the same as the old values. A read from another thread of any of the the non-changed bits does not interfere with the write, at least not on the x86. Of course if yet another thread modified one of those bits, it would be a real bug, and it would be caught separately. The pruning of this type of race requires analysis of surrounding code (looking for the masking of other bits).

Windows kernel also has some special variables that are racy by design–current time is one such example. DataCollider has these locations hard-coded and automatically prunes them away.

There are benign races that are hard to prune automatically, and those are left for manual pruning (in fact, DataCollider reports all races, it just de-emphasizes the ones it considers benign). One of them is the double-checked locking pattern (DCLP), where a thread makes a non-synchronized read to be later re-confirmed under the lock. This pattern happens to work on the x86, although it definitely isn’t portable.

Finally, there is the interesting case of idempotent writes— two racing writes that happen to write the same value to the same location. Even though such scenarios are easy to prune, the implementers of DataCollider decided not to prune them because more often than not they led to the uncovering of concurrency bugs. Below is a table that summarizes various cases.

Benign race Differentiate from Pruned?
Lossy counter Reference counting Yes
Read and write of different bits Read and write of the whole word Yes
Deliberately racy variables Yes
DCLP No
Idempotent writes No

Conclusion

In the ideal world there would be no data races. But a concurrency bug detector must take into account the existence of benign data races. In the early stages of product testing the majority of detected races are real bugs. It’s only when chasing the most elusive of concurrency bugs that it becomes important to weed out benign races. But it’s the elusive ones that bite the hardest.

Bibliography

  1. John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk, Effective Data-Race Detection for the Kernel