Recently we’ve had yet another discussion at work about the definition of data race. You’d think that being in the business of detecting data races, we should know what a data race is. Unfortunately, it’s not so simple. Defining a data race is one of the harder problems faced by the designers of memory models (and concurrency bug checkers). Java had it figured out some time ago, and now C++11 has its own definition. But our product, Jinx, is language agnostic–it catches data races at the processor level.

The problem is that there is no one-to-one mapping between data races as defined by a higher-level language and the object code produced by the compiler. Consider a lock-free algorithm written in C++ using weak atomics (e.g., using memory_order_relaxed operations on atomic variables). When compiled for the x86 processor, the assembly instructions might look identical to those generated without atomics. But from the point of view of C++, the first version is race-free and the second is full of races. In this case the role of atomics is reduced to being nothing more than programmer’s annotation that says, “I know what I’m doing.” (And also, “This code is portable.”)

On the other hand, neither is there one-to-one correspondence between data races, as defined by a higher-level language, and concurrency bugs. Sometimes looking at how the code is actually executed on a multicore processor may flag concurrency errors that don’t manifest themselves as data races at the source level. And then there are programs that mix high-level languages and inline assembly, for which high-level analysis is impossible.

The purpose of this blog post is to gain some insight into data races at the processor level. I’ll start with the description of the x86 memory model and a definition of data race. I’ll explain the relationship between data races and sequential consistency. Then I’ll show you an example of a Linux spinlock, which contains a deliberate data race. This race can be classified as a non-triangular data race. I’ll explain what a triangular data race is and why it might be important in analyzing program execution.

The x86 Memory Model

In order to understand data races at the lowest level you have to dig into a processor’s memory model. For the x86, both Intel and AMD produced documents that provide some guidance, but fall short of defining a formal model. It took a lot of research and double guessing before academics and Intel engineers agreed on one particular model called Total Store Order (TSO).

The gist of this model is that, at least conceptually, all writes to memory from a particular core go through this core’s write buffer, which is invisible to other cores. The contents of this buffer find its way to global memory in FIFO order at some random intervals or as a result of special instructions. These special instructions are either LOCK’d instructions (the ones with the LOCK prefix) or fences (actually, only MFENCE is relevant here, because the model covers regular write-back memory, as opposed to write-combining memory or non-temporal instructions). Those instructions flush the buffer of the core that is executing them. It’s pretty surprising that such a simple model explains all the strange behaviors of the x86 processor with respect to memory loads and stores.

Total Store Order processor memory model

A few more details: Several x86 instructions are modeled as a series of actions (see our earlier blog entry). In particular, a LOCK’d instruction can be split into smaller constituents:

  1. LOCK, which flushes the buffer and locks access to global memory for all cores,
  2. loads, stores, and arithmetic actions specific to a particular instruction,
  3. UNLOCK, which flushes the buffer and unlocks access to global memory.

For instance, the execution of:

LOCK XCHG [x], EAX

may be modeled by:

  1. flushing the local write buffer,
  2. acquiring the lock on the bus,
  3. loading x from memory to a temporary register,
  4. writing EAX to x (through the local write buffer),
  5. copying the value from the temporary to EAX,
  6. flushing the write buffer (thus forcing the new value of x to be written to global memory), and
  7. releasing the bus lock.

Keep in mind that this is an idealized model, not necessarily corresponding to actual architecture or microcode.

x86 Data Races

Let’s start with the simplest definition of a data race and build things up from there. Imagine that you are single-stepping through a program that has multiple threads running on multiple cores. At each step you pick one core and execute the next instruction in its thread. For complex (but not LOCK’d) instructions, like INC[x], you might execute just one of its actions, e.g, the load of [x], at a time. This kind of step-by-step execution is called sequentially consistent, as long as all stores go directly to global memory.

A data race occurs if you can pick two actions accessing the same memory location on two separate cores to be executed next to each other, and at least one of them is a store. Notice that if these two actions can be made adjacent in a global execution order, they can also be executed in the reverse order. Now, if both actions were just memory loads, the order wouldn’t matter. This is the reflection of a more general principle that you can never have data races on immutable data. But if one of the accesses is a store, the order does matter.

This definition of data race is the foundation of dynamic race detectors, including Jinx. Jinx virtually (that is, using a virtual machine technology) re-runs parts of your program in a sequentially consistent manner, rearranging the threads in order to, among other things, bring shared memory accesses next to each other. If the program is properly synchronized, this should be impossible. For instance, one of the threads will be blocked waiting for a lock (this time I’m talking about higher-level locks, or critical sections) while the other is accessing a shared variable under the same lock.

Of course, the next question is about the locks themselves: They are implemented using the same instructions as the rest of the code, and use shared variables, so how come they don’t generate their own data races? Some race detectors recognize a few predefined patterns of instructions that correspond to locks in a particular language, library, or operating system. Again, Jinx tries to be agnostic.

Maybe unsurprisingly, higher level locks are implemented using LOCK’d instructions. So if we narrow the definition of data races to exclude conflicts between locked instructions, we should be fine. But are we allowed to play with definitions? How do we decide what is and what isn’t a data race? Are there some general principles to give us guidance? There is one I mentioned before called…

Sequential Consistency

Ideally, we would like to define data races so that they are always concurrency bugs. This isn’t easy, because we don’t have a definition of a bug. In a higher-level language, like C++, it’s enough to say that a data race leads to undefined behavior. But at a processor level, everything is defined. A processor won’t suddenly stop and say, “I don’t know what to do with this program.”

However, higher level languages give us a hint in terms of the DRF guarantee. DRF stands for Data-Race Freedom, and the guarantee is that if your program is free of data races, it is sequentially consistent (SC). Mind you, an SC program might still be buggy, and even have concurrency bugs, but at least we know how to reason about such programs. We run them in our minds, one action at a time, switching between threads at will, assuming that all writes go directly to main memory and are immediately visible to other threads. We know that this is not what really happens, but if a program is DRF, this is how it behaves.

Not all languages offer the DRF guarantee; C++ supports weak atomics that break SC without introducing data races. Nevertheless, DRF is a good goal.

What does it all mean for the x86? The definition of SC stands at the assembly level, although we have to allow for splitting complex instructions. For instance, INC[x] could be split into three actions: a load, an increment, and a store. If this instruction doesn’t have a LOCK prefix, another thread is free to execute its own actions in between the actions of the current instruction.

A TSO processor, like the x86, would be almost SC, if it weren’t for those pesky write buffers. Still, many memory-access conflicts on an x86 don’t break sequential consistency. For instance, because of the FIFO nature of the write buffers, two conflicting stores don’t break SC, and therefore are not considered a data race. Think of it this way: Inverting two adjacent stores from different threads doesn’t change anything because both stores go to separate write buffers. The external visibility of those stores depends on the order in which the buffers are discharged into global memory, and they are discharged in a sequentially consistent way.

So the only races on the x86 are those between loads and stores, except when both are LOCK’d (in which case they flush their buffers). That’s a good starting point, but a practical question remains: If we report all of those events as data races, how many of them will be considered false positives in real life? Unfortunately, the answer is: too many. It turns out that an efficient implementation of higher-order locks often contains deliberate races. It’s true that the acquisition of a lock usually starts with a LOCK’d intruction — often a CMPXCHG, also known as CAS– but the corresponding release might be implemented using an unsynchronized store. This is why the Jinx definition of a data race excludes conflicts in which even one of the instructions is LOCK’d.

Do we miss some legitimate data races because of that? Perhaps, but we haven’t seen such examples in practice yet. Interestingly, even this narrowed definition of data races flags some legitimate lock-free algorithms, in particular the Linux ultra-fast implementation of a spinlock.

Spinlock

Here’s an actual report produced by Jinx on Linux (Jinx can be used to debug the kernel):

Event: Data Race (3 times)
* Stack A
#0 0xffffffff81036d44 in __ticket_spin_lock at spinlock.h:65
#1 0xffffffff8159e7ee in arch_spin_lock at paravirt.h:744
#2 0xffffffff810909cd in futex_wake at futex.c:891
#3 0xffffffff81093288 in do_futex at futex.c:2602
#4 0xffffffff810933fb in sys_futex at futex.c:2636
* Stack B
#0 0xffffffff81036d74 in __ticket_spin_unlock at spinlock.h:101
#1 0xffffffff81090a58 in arch_spin_unlock at paravirt.h:760
#2 0xffffffff81093288 in do_futex at futex.c:2602
#3 0xffffffff810933fb in sys_futex at futex.c:2636

Two threads, A and B, ran into a race in spinlock.h. One access occurred at line 65 and the other at line 101 of that file. The two stack traces show that the caller, in both cases, was a futex, or a “fast mutex”– a Linux implementation of a thin lock.

Let’s have a look at spinlock.h. Line 65 is inside the function __ticket_spin_lock, which contains a bunch of inline assembly, which I’m not going to analyze here (although I will analyze similar code shortly):

 61 static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 62 {
 63     short inc = 0x0100;
 64
 65     asm volatile (
 66         LOCK_PREFIX "xaddw %w0, %1\n"
 67         "1:\t"
 68         "cmpb %h0, %b0\n\t"
 69         "je 2f\n\t"
 70         "rep ; nop\n\t"
 71         "movb %1, %b0\n\t"
 72         /* don't need lfence here, because loads are in-order */
 73         "jmp 1b\n"
 74         "2:"
 75         : "+Q" (inc), "+m" (lock->slock)
 76         :
 77         : "memory", "cc");
 78 }

The thing to notice is that this is the acquisition of a (ticketed) spinlock, the first instruction of which is indeed LOCK’d. Since Jinx doesn’t flag races with LOCK’d instructions, it must be the next instruction that accesses memory, movb, which is involved in the race. It reads a byte-sized value from memory without using a LOCK.

The other racing line, 101, is inside __ticket_spin_unlock:

 99 static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
100 {
101     asm volatile(UNLOCK_LOCK_PREFIX "incb %0"
102          : "+m" (lock->slock)
103          :
104          : "memory", "cc");
105 }

This is the release part of the ticketed spinlock, and it contains an increment instruction that accesses the memory (the same lock->slock variable as in the previous function). The macro, UNLOCK_LOCK_PREFIX, expands to nothing on processors other than Pentium Pro (which had an error). So we do have an unprotected store (part of incb) that races with an unprotected load (movb in the previous function). And yet, the code is correct. But to understand that, we need to learn about a more sophisticated notion of…

Triangular Data Races

Let’s go back to the x86 TSO memory model. The cores write into their private FIFO buffers. Data moves from these buffers into main memory. At any point in time only one buffer may discharge a previous write entry into memory, so those writes end up interleaved with each other. What’s important is that all processors see those writes appearing in main memory in the same order — the Total Store Order. (To be precise, the processor doing the write will see it immediately due to intra-processor forwarding, but that doesn’t matter in our case.)

Now imagine that you have an actual execution on the x86 and you want to show that it’s equivalent to an SC execution. You could do it, for instance, by rearranging instructions so that stores are moved up to the point where they were originally discharged into global memory. In order to preserve program order on individual processors, you would also have to move up the loads (and other instructions) that appear between those stores. But if you move a load forward too much, it might see a different store than it did in the original execution, and the new SC execution will not be equivalent to the original one.

A typical situation is when stores and loads cross each other, as in this program (x and y originally zero):

P1 P2
mov [x], 1 mov [y], 1
mov eax, [y] mov eax, [x]

If stores from both processors still sit in their respective buffers while the loads are performed, both eax registers will end up with zeros. If we try to move the stores forward to the point where the buffered writes were flushed, we’ll also have to push the loads, and they will end up reading ones instead of zeros. This is clearly a non-SC execution and the races it contains are definitely of interest to us.

Rearranging stores to create an equivalent execution with no buffering. Because program order must be preserved, the loads are rearranged too, resulting in different values being read. The SC execution is not equivalent to the original one. (Time flows from left to right.)

It turns out that all violations of SC on the x86 have one thing in common–a triangular data race. This is a triangular pattern (We have two of those in our example.):

P1 P2 Comment
mov [x], 1 Preceding store
No lock or fence
mov eax, [y] mov [y], 1 The race between a load and a store

More generally, a triangular data race occurs between a load action on one processor and a store action on another if the load is preceded by some other store. In the case above we have a race on y, and what makes it triangular is the odd store to x. The key is that this store may still sit in the write buffer when the racing load is made. If, on the other hand, there is any fence or LOCK/UNLOCK between the odd store and the racing load, we are fine, the race is not triangular.

What’s amazing is that it can be proven that any x86 program that doesn’t contain triangular data races is sequentially consistent. (For a formal definition of a triangular data race and a proof of the theorem, see Owen.) Armed with this knowledge, let’s revisit the spinlock example.

Spinlock and Triangular Races

The Linux ticketed spinlock is an elaboration on a basic spinlock. Here’s the (more readable) Intel assembly code for that simplified spinlock. There is a variable, lck, on which to spin, and its initial value is 1, corresponding to the lock being unclaimed. The address of lck is stored in eax.

Label Instruction Comment
acquire: lock dec [eax] atomic {
tmp = [lck] – 1
[lck] = tmp
flag = (tmp >= 0)
flush local write buffer }
jns enter if (flag) goto enter
spin: cmp [eax], 0 flag = ([lck] <= 0)
jle spin if (flag) goto spin
jmp acquire goto acquire
enter: the critical section starts here

To acquire the lock, we first atomically decrement lck. If the lock was free, the decremented value should be non-negative (actually, zero). If it isn’t, we start spinning, waiting for lck to become positive, and then try to acquire it again.

To release the lock, we set lck back to 1:

release: mov eax, 1 [lck] = 1

Notice that the load part of the instruction cmp [eax],0 is in a data race with mov eax,1. This is the race that is detected by Jinx.

But does this race break sequential consistency? In order to break SC, the race would have to be triangular, which means that the racing load would have to be preceded by some other store that’s not flushed. But the last store before entering the spin loop is part of the LOCK’d instruction, lock dec [eax], which flushes the write buffer. Since there is no triangular race, the spinlock is indeed sequentially consistent.

Conclusion

It’s eye opening to realize how much wiggle room there is in the definition of data race. We have the informal definition: Two simultaneous accesses to the same memory location, one of them being a write. Except that now we have to define “simultaneous,” which is a bit tricky. Simultaneity is characterized by lack of ordering–two events are simultaneous if it’s impossible to tell which came before the other.

Events on different processors may be ordered through synchronizing actions. At the lowest level, these are LOCK’d instructions and fences. However, there is no simple mapping between high-level-language definitions of data races and what is observed at the processor level during execution.

Ultimately, the reason for defining data races is to be able to identify one type of concurrency bugs. But bug detection is in general undecidable, so the best we can do is to define behaviors that are most likely related to programmers’ errors. Most programmers reason about concurrent programs using the sequentially consistent model of execution. This is why the most insidious concurrency bugs are related to the breakdown of sequential consistency. There is a linkage between data races and (the lack of) sequential consistency, which is expressed in terms of the DRF guarantee: A program with no data races behaves in a sequentially consistent way.

We can reverse the reasoning and define data races in a way that would guarantee the DRF property, and this is what is usually done. Seen from that perspective, it’s possible to strengthen the DRF guarantee on TSO processors. The new TRF (Triangular Race Freedom) guarantee states that a program with no triangular data races behaves in a sequentially consistent way and vice versa.

We may reasonably assume that non-triangular data races are not bugs but rather were deliberately coded, either directly by programmers (as was the case with the Linux spinlock), or by a compiler translating a sequentially consistent high-level program. The jury is still out, though, because outside of lock-free programming any race, triangular or not, is still very likely to be a bug.

Acknowledgments

I’d like to thank Anthony Williams, Dmitriy V’yukov, and my co-workers for valuable feedback.

Bibliography

  1. Scott Owens, Susmit Sarkar, Peter Sewell, A better x86 memory model: x86-TSO
  2. Scott Owens, Reasoning about the Implementation of Concurrency Abstractions on x86-TSO
Advertisements