concurrency


If you expected std::async to be just syntactic sugar over thread creation, you can stop reading right now, because that’s what it is. If you expected more, read on.

Don’t get me wrong, std::async combines several useful concurrency concepts into a nice package: It provides a std::future for the return value, and hides the std::promise side of the future. It also provides options to run a task synchronously. (See the Appendix for a short refresher.)

But tasks have a slightly different connotation in parallel programming: they are the basic blocks of task-based parallelism. And C++11 tasks fall short on that account.

Task-Based Parallelism

Tasks are an answer to performance and scalability problems associated with threads. Operating system threads are rather heavy-weight; it takes time and system resources to create a thread. If you have an algorithm that naturally decomposes into a large number of independent computations, a.k.a. tasks, you’ll probably get your best performance not by creating a separate thread for each task, but by adjusting the number of threads depending on the amount of parallelism available on your particular system, e.g., the number of cores and their current loads. This can be done by hand, using thread pools and load balancing algorithms; or by using task-based systems.

In a task-based system, the programmer specifies what can be run in parallel but lets the system decide how much parallelism to actually use. The programmer splits the algorithm into tasks and the runtime assigns them to threads — often many tasks to a thread.

There are many implementations of task-based parallelism with varying language support. There’s the Cilk language which pioneered this approach; there’s the built in support in Haskell, F#, and Scala; and there are several C++ libraries, like Microsoft PPL or Intel TBB.

Unlike thread creation, task creation is supposed to be relatively inexpensive, letting the programmer explore low-level granularity parallelism and take advantage of multicore speedups.

At the center of task-based systems are work-stealing queues. When a thread creates tasks, they are initially queued on the processor (core) that runs the thread. But if there are other idle processors, they will try to steal tasks from other queues. The stolen tasks are then run in parallel.

Notice that tasks must be able to migrate between threads. What’s more, efficient use of OS threads requires that tasks that are blocked, for instance waiting for I/O or waiting for other tasks to finish, should be taken off their threads, so that other tasks may reuse them.

C++11 Tasks

My expectation was that C++11 “tasks” that are created using std::async should be abstracted from threads, just as they are in task-based parallelism. When I started preparing a video tutorial about tasks in C++, I wrote a simple program to demonstrate it. I created async tasks using the default launch policy and waited for them to complete. Each task slept for one second and then printed its thread ID.

int main() 
{
    std::cout << "Main thread id: " << std::this_thread::get_id() 
        << std::endl;
    std::vector<std::future> futures;
    for (int i = 0; i < 20; ++i)
    {
        auto fut = std::async([]
        {
            std::this_thread::sleep_for(std::chrono::seconds(1));
            std::cout << std::this_thread::get_id() << " ";
        });
        futures.push_back(std::move(fut));
    }
    std::for_each(futures.begin(), futures.end(), [](std::future & fut)
    {
        fut.wait();
    });
    std::cout << std::endl;
}

The results were surprising. The first six tasks executed in parallel, each in its own thread. But the rest of the tasks executed in the main thread one after another, separated by 1 second intervals. (Note: this behavior was fixed in v 1.7 of Just::Thread — read on).

The output of the task test

This approach to parallelism obviously doesn’t scale very well.

Then I wrote another program that lists directories recursively, creating a separate async task for each subdirectory. But this time I explicitly requested launch policy launch::async, which guarantees that each task will start in a new thread. This program worked up to a point, but when I tried to list my whole disk, it failed by exhausting Windows’ limited thread creation capability. Again, this approach doesn’t scale very well.

What was even worse, when the program didn’t fail, it performed better with launch::deferred policy, which forces all tasks to be executed serially, than with the launch::async policy. That’s because thread creation in Windows is so expensive that it can easily nullify performance gains of parallelism (although Windows 7 supports user-level threads, which might bypass these problems).

My first reaction was to blame Anthony Williams for badly implementing the Just::Thread library I was using. When he assured me that it was Standard compliant, I turned to Herb Sutter and Hans Boehm for confirmation and they sided with Anthony. It turns out that there are serious problems that prevented C++11 from standardizing task-based concurrency.

The Problems

The foundation of task-based parallelism is the ability for tasks to share threads and to migrate between threads. This sharing and migration must be transparent.

The requirements for the default-launch tasks are the following:

  • The runtime can either run such task asynchronously or synchronously
  • When it’s run synchronously, it should be run in the context of the parent thread
  • When it’s run asynchronously, it should behave as if it were run on a separate thread

Strictly speaking, a task could always call this_thread::get_id() and fool any attempt at thread sharing or migration by discovering the ID of the current thread. In general, the namespace std::this_thread, which also contains sleep functions, is thread-bound.

But let’s suppose that we only require that asynchronous tasks behave as if they were run on separate threads, except when they call functions in the this_thread namespace. There are still several problems.

Thread-Local Variables

C++11 introduced a new thread_local storage qualifier. A thread-local variable is separately initialized and destroyed in every thread. It must not survive thread termination. This requirement complicates thread sharing.

In our exchange, Hans Boehm clarified the termination requirement for tasks: Thread-local variables must be fully destroyed before the owning thread returns from calling get or wait on a future produced by the corresponding std::async; or before the destructor of that future returns, whichever comes first.

This actually leaves some wiggle room: A thread could be reused if the runtime guarantees that thread-local variables of terminating tasks are correctly destroyed. Unfortunately, this might be impossible if the programmer calls OS-specific API, like Windows’ TlsAlloc. Anthony also pointed out that it’s not clear how to deal with DLL_THREAD_DETACH handlers in DLLs, when switching to task granularity.

Locks

There’s another aspect of C++11 concurrency that is tied to threads — locking. The std::mutex object is thread aware. It requires that unlock is called from the same thread as lock. Why should this be a problem?

I haven’t mentioned yet that task migration might be necessary in the middle of execution, but that is what most task-based systems do. It’s an optimization in the case when you’re dealing with blocking tasks.

There are two major blocking scenarios: external and internal. External blocking happens when a task calls an OS function (directly or indirectly) that may block, for instance waiting for I/O. My directory listing program did a lot of that. Internal blocking, which is potentially easier to intercept, happens when tasks are blocked on futures. My program did a lot of that too, when waiting for the results of tasks that were listing subdirectories of the current directory.

A blocked task doesn’t use processor resources, but it does occupy a thread. That thread could be reused to run another task. But that requires a clean way of taking a task off a thread and then restoring its state once the call unblocks. Now, if the task takes a lock on a mutex before blocking, it cannot be migrated to another thread. The unlocking wouldn’t work from another thread.

Herb Sutter observed that, if we tried to restore the task to its original thread, we might get into a spurious deadlock, when the original thread is occupied be another task waiting for the same mutex.

The other problem with locks is in the use of a recursive_mutex. A thread may lock such a mutex multiple times before calling unlock (also multiple times). But if a second thread tries to lock a mutex that’s owned by the current thread, it will block. As long as tasks run is separate threads, this works. But if they share the same thread, they may successfully acquire the same mutex and cause data corruption.

Imagine the following scenario. Task A wants to modify a shared data structure and takes a lock on its recursive mutex. It then blocks on some OS call (probably not a good idea in general, but it may happen). The task gets taken off of the current thread, and task B starts executing. It takes a lock on the same mutex — successfully, as it is executing in the same thread, and reads or modifies a data structure that was in the middle of being modified by task A. A disaster unfolds.

Notice that this is not a problem if tasks are run serially in the same thread, as it happens with the launch::deferred policy, because each task runs to completion before allowing another task to run.

Finally, such migration of running tasks would also wreaks havoc with thread-local variables.

Possible Solutions

Optimizing the Default Launch Policy

The easiest part was to change the implementation of the default policy, to defer the decision whether to run a given task asynchronously or as deferred. Anthony was quick to notice this, and almost immediately released a fix — version 1.7 of Just::Thread.

The idea is simple, you schedule N tasks asynchronously — N being some runtime number dependent on the number of available cores — and put the rest on a queue. When any of the queued tasks is forced (by the call to get or wait on its future), it is executed synchronously in the context of the forcing thread — as if the launch::deferred policy were used. Otherwise, as soon as one of the asynchronous tasks finishes, the next task from the queue is scheduled to run asynchronously. Here’s the output of the same test program after the changes in Just::Thread:

The output of the test with the new library


This time each task ran in a separate thread, but because of the default launch policy, they ran in small batches that could effectively exploit the parallelism of a multicore machine. Still, without thread reuse, the runtime had to create 22 OS threads. The hope is that the operating system caches thread resources so that the creation of the second batch of threads is substantially cheaper than the first one.

(I suggest running this test when evaluating any implementation of a task library.)

Thread Reuse

The next step in improving task performance would be to use a thread pool and reuse threads instead of creating them from scratch. Because of the problem with thread-local variables, it might be impossible to implement thread reuse without some help from the language runtime. The task library would need hooks into every creation of a thread_local variable, so it can destroy them at task exit.

That still leaves the problem of tasks calling APIs like TlsAlloc directly. An atractive option (for library writers) would be to ignore the problem — after all the language provides a portable way of dealing with thread-local storage.

Task Migration

We would like to be able to remove a blocked task from a thread in order to run another task on it. This is not easy because of thread-locals and locks.

The problem with thread_local variables is that they should really be task-local. Or at least they should behave “as if” they were task-local. So when two tasks are sharing the same thread, there has to be some mechanism for “context switching” between them. The context would have to include the state of all thread-local variables.

Migrating a task that is holding a lock could only be done if locks were task-bound rather than thread-bound. Interestingly, there is a provision in the Standard for this kind of behavior. The definition of Lockable in (30.2.5) talks about “execution agents” that could be threads, but could also be something else. This comment is of particular interest:

[ Note: Implementations or users may introduce other kinds of agents such as processes or thread-pool tasks. —end note ]

However, the Standard Library mutex is bound to threads, not tasks. The intention of (30.2.5) is that, if you create your own separate task library with your own task-local variables and mutexes, you will still be able to use the standard utilities such as lock_guard or condition variables. But the implementation of std::async tasks must work with thread_local and std::mutex.

Deadlocks

Here’s a potential scenario where two tasks could deadlock if their threads are reused while they are blocked:

  1. Task A runs on thread T1, takes the mutex M1, and makes a blocking call
  2. The runtime takes A off T1 (saves its state, etc.) and puts it in a Blocked queue
  3. Task B starts executing on the same thread, T1, and tries to take M1, which is locked by A
  4. In order to unlock M1, task A would have to run on T1 — the same thread the lock was taken on — but T1 is now occupied by B, and A can’t make progress

The only way to resolve this deadlock is to take B off the thread. So that’s what a task migration system must do — guarantee that any blocked task is taken off its thread.

In general, any spurious deadlock would involve a bunch of blocked tasks. If all of them are blocked on locks, this is an actual deadlock which would happen anyway. If there is at least one task that can make progress when its blocking call returns, it can always be assigned back to its thread, either because the task running on it completes, or because it’s blocked and will be taken off of it.

Of course if we allow lock migration, as in associating locks with tasks rather than threads, the problem disappears on its own.

Conclusion

What I learned from this exercise was that std::async with default launch policy can be made usable. However its strong association with threads makes it virtually impossible to implement full-blown task-based parallelism. A task-based system could be implemented as a library but it would have to come with severe restrictions on the use of thread_local variables and standard mutexes. Such a system would have to implement its own version of task-local variables and mutexes.

I’m grateful to Anthony Williams, Hans Boehm, Herb Sutter, and Artur Laksberg for enlightening discussions.

Appendix: async Refresher

Here’s some typical code that uses std::async to start a task:

auto ftr = std::async([](bool flag)->bool
{
    if (flag)
        throw std::exception("Hi!");
    return flag;
}, true); // <- pass true to lambda
// do some work in parallel...
try
{
    bool flag = ftr.get(); // may re-throw exception
}
catch(std::exception & e)
{
    std::cout << e.what() << std::endl;
}

The code calls std::async with a lambda (anonymous function) that takes a Boolean flag and returns a Boolean. The lambda can either throw an exception or return the flag back. The second argument to async (true, in this case) is passed to the lambda when it is executed.

The value or the exception is passed back to the parent code when it calls the get method on the future returned by async. The call to async may create a new thread, or defer the execution of the function until the call to get is made.

The same code may be implemented directly using std::thread, std::promise, and std::future but, among other things, it requires modifications to the thread function (here, to the lambda):

std::promise prms;
auto th = std::thread([](std::promise<bool> & prms, bool flag)
{
   if (flag)
     prms.set_exception(std::make_exception_ptr(std::exception("Hi!")));
   else
     prms.set_value(flag);
}, std::ref(prms), true);
// do some work
th.join();
auto ftr = prms.get_future();
try
{
   bool flag = ftr.get();
}
catch(std::exception & e)
{
   std::cout << e.what() << std::endl;
}

By the way, I haven’t been slacking off. The C++11 Concurrency Series of video tutorials is now at the fifth installment (there are links to the previous tutorials as well). I’ve covered threads, fork/join, futures, promises, async tasks, task-based concurrency and much more. Next time I’ll try to gently introduce elements of map/reduce. I’m surprised how much can be done without locks and traditional synchronization (although those will come later in the series).

I’m starting a series of hands on tutorials on concurrent programming using C++11. This first installment jumps right into “Hello Thread!” and the discussion of fork/join parallelism. I also sneaked in some lambdas and even one closure. Enjoy!

Concurrency bugs are hard to spot and hard to debug. They have a tendency to slip through the cracks of traditional testing. In this blog post I’ll go through a simple example of a buggy program and explain why the bug is so hard to reproduce and how a dynamic debugger like Jinx can accelerate this proces by orders of magnitude.

Toy Example

How do you show somebody that their program has a concurrency bug? You come up with an interleaving of threads that demonstrates the bug: “What if thread A is pre-empted between this instruction and the next one, and thread B jumps in and executes that instruction?”

Here’s a trivially incorrect example of a concurrent class:

class Circle {
   std::mutex mutex;
   double r; // radius
   double c; // circumference
public:
   Circle(double radius) : r(radius), c(2 * PI * r) {}
   void SetRadius(double radius) {
      mutex.lock();
      r = radius;
      mutex.unlock();
   }
   void UpdateCircumference() {
      mutex.lock();
      c = 2 * PI * r;
      mutex.unlock();
   }
   double GetArea() {
      mutex.lock();
      double area = 0.5 * c * r;
      mutex.unlock();
      return area;
   }
};

The problem arises if thread A is doing this:

circle.SetRadius(r1);
circle.UpdateCircumference();

while thread B is doing this:

area = circle.GetArea();

The question you might ask the unfortunate programmer during a code review (you have those, right?) is: “What if thread A is pre-empted between SetRadius and UpdateCircumference and thread B jumps in and executes GetArea?”. That’s the interleaving that exposes the atomicity violation bug in this code.

Code reviews don’t catch all bugs, so how about some defensive programming? Suppose the programmer adds an assertion, just in case:

area = circle.GetArea();
assert(area == PI * circle.GetRadius() * circle.GetRadius());

Will this help in uncovering the bug? (By the way, did you notice that this assertion has a concurrency bug of its own?)

Back-of-the-Envelope Calculation

Let’s assume that, during normal testing on a multicore machine, the non-atomic update in thread A executes once a second, and so does the the GetArea call in thread B.

We’ll hit the bug if thread B executes its _mutex.lock() in GetArea after thread A takes the lock in SetRadius and before it re-takes the lock in UpdateCircumference. That’s the window of vulnerability.

If thread B starts its call within the vulnerability window, it will read the new value of r and the old value of c. Notice that while thread A holds the lock, thread B is spinning, waiting for thread A to release the lock.

Most of the time in this window is spent acquiring and releasing the lock — which on average should take about 1000 processor cycles. On a 1GHz core that’s about 10-6s.

In our example, we would have to run the test continuously for about 106s in order to hit the bug. That’s about 10 days of testing. Of course your mileage may vary, but you see what the problem is. Concurrency bugs are very hard to reproduce.

Crash Early and Crash Often

A tool that could accelerate the detection of concurrency bugs would save the developers and testers a lot of time and effort. That’s what Corensic’s Jinx does. The basic idea is simple: Run multiple simulations of the execution of code fragments in parallel, carefully picking the candidate interleavings to increase the chances of hitting the vulnerability window.

The enabling technology in this case is virtualization. Jinx is able to periodically virtualize the machine for short periods of time in order to run its simulations.

XXX

During Simulation Round, Jinx runs several exploratory executions in virtual space, picks the most promising one, and runs it in real space. In this case, Jinx found an execution that leads to a crash.

Here’s what Jinx does when it takes control of the machine:

  1. It stops the running program and takes a snapshot of its state.
  2. It runs about a millisecond-worth of execution of the program in virtual space and records all memory accesses.
  3. It explores several alternative executions of the same fragment in virtual space. It ranks them and picks (“retires” in our jargon) the one that either crashes the program or is most likely to manifest a concurrency bug.

In the picture above, Jinx took a snapshot at point A and ran several simulations, one of them resulting in a (virtual) crash at point B. It then retired this particular simulation and ran it in real space, crashing the program at point C.

Of course, the most important part of the algorithm is the choice of executions for simulations. Random rescheduling of the threads is as unlikely to catch the bug as is regular testing. Exhaustive exploration of all possible interleavings would take astronomical time.

Smart Exploration

Imagine that you film the activities of all processors while executing a multithreaded program. Each frame corresponds to a single processor instruction. You’d end up with N tapes from N cores. Your task is to splice those tapes into one continuous movie. You can cut the tapes at arbitrary locations and interleave the fragments. Because of combinatorial explosion, the number of possible movies is overwhelming. That’s how many possible executions there are in a concurrent program. But Jinx has some additional information that cuts down this combinatorial space to a much more reasonable size.

As I mentioned before, during the first exploratory execution of a program fragment Jinx records all memory accesses. Of particular interest to us are accesses to shared memory — we recognize those when a thread touches a memory location that’s already been touched by another thread. Those shared memory accesses are our cue points — that’s where we cut the tape.

The idea is that the fragments between shared-memory accesses have no dependency on what other threads are doing, so any interleaving between them will have no impact on program behavior. What matters is the order in which the shared memory accesses happen, and their re-arrangements may produce distinct observable behaviors.

Since in a typical concurrent program there aren’t that many shared variables, the number of possible rearrangements is quite manageable.

Let’s go back to our example and study it in more detail. The simplest implementation of a mutex is based on one shared memory location, let’s call it l, and the accesse to it goes through some version of the exchange instruction (for instance LOCK CMPXCHG), which we will simply call xchg. From the point of Jinx, both lock and unlock look like xchg l. There are two more shared variables, circle.r and circle.c.

Suppose that, during the initial exploration, threads A and B executed their respective accesses to Circle. As I explained earlier, it’s very unlikely that thread B would hit the vulnerability window, so the execution proceeded without conflict and Jinx recorded all accesses to shared variables.

Initial exploration: Thread B makes its call to GetArea far outside the vulnerability window. Only accesses to shared memory are represented in this diagram.

Let’s concentrate on the shared variable l used by the mutex. Thread A accessed l four times, so there are four distinct locations to which thread B’s first access to l could be moved (this is done by selectively delaying some threads). Jinx might therefore run four possible simulations involving l. Here’s the first one:

Simulation 1: Thread B takes the lock before thread A's first access to l. Notice that thread A has to spin (the xchg instruction is repeated three times). It has to wait for thread B to release the lock. Thread B correctly calculates the old area of the circle.

Notice an interesting fact: The simulated execution is not just a simple splicing of two recorded tapes. Unlike in the original execution, when thread A tries to enter the critical section it is blocked by thread B and has to spin. A new choice of interleaving often leads to a different execution.

The second and third simulations result in the bug manifesting itself.

Simulation 2: Thread B accesses l between thread A's first and second access to l. This time thread B has to spin until thread A releases the lock. After that, thread A has to spin until thread B releases the lock. This simulation hits the bug -- thread B sees the new value of r and the old value of c.

Simulation 3: Thread B accesses l between thread A's second and third access to l. This simulation also hits the bug.

The fourth simulation misses the vulnerability window.

Simulation 4: Thread B accesses l between thread A's third and fourth access to l. Thread B has to spin, but the bug is not hit -- the new area of the circle is correctly calculated.

This is not bad. Suddenly the likelihood of producing buggy behavior shot up substantially.

Let’s do another back-of-the-envelope calculation. Assume that during each simulation round Jinx samples about 1ms of program execution and works on it for roughly 100ms. In order not to slow the program too much, Jinx may perform, say, 5 simulations rounds per second.

The probability that both accesses from our example will fall within the 1ms sampling period is 10-3. Jinx can perform the 103 simulation rounds necessary to reproduce the bug in 200 seconds or slightly more than 3 minutes. Compare this with the 10 days of unassisted testing we got previously.

What Happens Next?

Suppose that, after running for about 3 minutes, Jinx found a simulation that manifested a concurrency bug. The question is: What are the consequences of this bug? If the bug causes an assertion to fire, or results in an access fault within the same 1ms simulation period, Jinx will immediately re-run the same simulation in real time and cause the program to crash, thus proving beyond reasonable doubt that there was indeed a bug in the program.

But even if Jinx cannot trip the program, it will still rate different simulations according to its own heuristics, and it will run the one that scores the highest. In our example, even if the five simulations (including the original run) are rated the same, Jinx will pick one of them with the probability 1/5. Since two of them expose the bug, Jinx should be able to hit one in less than 10 minutes. Although the program won’t crash immediately, it will be running with corrupt data, which will eventually be discovered.

Conclusion

A dynamic concurrency debugger works by rearranging thread interleavings. In general, this method leads to an exponential explosion. Jinx uses two innovative techniques to make this approach feasible: It runs multiple simulations in parallel using virtual machine technology, and it rearranges thread timings based on shared-memory communications. The result is a concurrency bug accelerator that can speed testing up by orders of magnitude while running a program at almost normal speed.

Recently we’ve had yet another discussion at work about the definition of data race. You’d think that being in the business of detecting data races, we should know what a data race is. Unfortunately, it’s not so simple. Defining a data race is one of the harder problems faced by the designers of memory models (and concurrency bug checkers). Java had it figured out some time ago, and now C++11 has its own definition. But our product, Jinx, is language agnostic–it catches data races at the processor level.

The problem is that there is no one-to-one mapping between data races as defined by a higher-level language and the object code produced by the compiler. Consider a lock-free algorithm written in C++ using weak atomics (e.g., using memory_order_relaxed operations on atomic variables). When compiled for the x86 processor, the assembly instructions might look identical to those generated without atomics. But from the point of view of C++, the first version is race-free and the second is full of races. In this case the role of atomics is reduced to being nothing more than programmer’s annotation that says, “I know what I’m doing.” (And also, “This code is portable.”)

On the other hand, neither is there one-to-one correspondence between data races, as defined by a higher-level language, and concurrency bugs. Sometimes looking at how the code is actually executed on a multicore processor may flag concurrency errors that don’t manifest themselves as data races at the source level. And then there are programs that mix high-level languages and inline assembly, for which high-level analysis is impossible.

The purpose of this blog post is to gain some insight into data races at the processor level. I’ll start with the description of the x86 memory model and a definition of data race. I’ll explain the relationship between data races and sequential consistency. Then I’ll show you an example of a Linux spinlock, which contains a deliberate data race. This race can be classified as a non-triangular data race. I’ll explain what a triangular data race is and why it might be important in analyzing program execution.

The x86 Memory Model

In order to understand data races at the lowest level you have to dig into a processor’s memory model. For the x86, both Intel and AMD produced documents that provide some guidance, but fall short of defining a formal model. It took a lot of research and double guessing before academics and Intel engineers agreed on one particular model called Total Store Order (TSO).

The gist of this model is that, at least conceptually, all writes to memory from a particular core go through this core’s write buffer, which is invisible to other cores. The contents of this buffer find its way to global memory in FIFO order at some random intervals or as a result of special instructions. These special instructions are either LOCK’d instructions (the ones with the LOCK prefix) or fences (actually, only MFENCE is relevant here, because the model covers regular write-back memory, as opposed to write-combining memory or non-temporal instructions). Those instructions flush the buffer of the core that is executing them. It’s pretty surprising that such a simple model explains all the strange behaviors of the x86 processor with respect to memory loads and stores.

Total Store Order processor memory model

A few more details: Several x86 instructions are modeled as a series of actions (see our earlier blog entry). In particular, a LOCK’d instruction can be split into smaller constituents:

  1. LOCK, which flushes the buffer and locks access to global memory for all cores,
  2. loads, stores, and arithmetic actions specific to a particular instruction,
  3. UNLOCK, which flushes the buffer and unlocks access to global memory.

For instance, the execution of:

LOCK XCHG [x], EAX

may be modeled by:

  1. flushing the local write buffer,
  2. acquiring the lock on the bus,
  3. loading x from memory to a temporary register,
  4. writing EAX to x (through the local write buffer),
  5. copying the value from the temporary to EAX,
  6. flushing the write buffer (thus forcing the new value of x to be written to global memory), and
  7. releasing the bus lock.

Keep in mind that this is an idealized model, not necessarily corresponding to actual architecture or microcode.

x86 Data Races

Let’s start with the simplest definition of a data race and build things up from there. Imagine that you are single-stepping through a program that has multiple threads running on multiple cores. At each step you pick one core and execute the next instruction in its thread. For complex (but not LOCK’d) instructions, like INC[x], you might execute just one of its actions, e.g, the load of [x], at a time. This kind of step-by-step execution is called sequentially consistent, as long as all stores go directly to global memory.

A data race occurs if you can pick two actions accessing the same memory location on two separate cores to be executed next to each other, and at least one of them is a store. Notice that if these two actions can be made adjacent in a global execution order, they can also be executed in the reverse order. Now, if both actions were just memory loads, the order wouldn’t matter. This is the reflection of a more general principle that you can never have data races on immutable data. But if one of the accesses is a store, the order does matter.

This definition of data race is the foundation of dynamic race detectors, including Jinx. Jinx virtually (that is, using a virtual machine technology) re-runs parts of your program in a sequentially consistent manner, rearranging the threads in order to, among other things, bring shared memory accesses next to each other. If the program is properly synchronized, this should be impossible. For instance, one of the threads will be blocked waiting for a lock (this time I’m talking about higher-level locks, or critical sections) while the other is accessing a shared variable under the same lock.

Of course, the next question is about the locks themselves: They are implemented using the same instructions as the rest of the code, and use shared variables, so how come they don’t generate their own data races? Some race detectors recognize a few predefined patterns of instructions that correspond to locks in a particular language, library, or operating system. Again, Jinx tries to be agnostic.

Maybe unsurprisingly, higher level locks are implemented using LOCK’d instructions. So if we narrow the definition of data races to exclude conflicts between locked instructions, we should be fine. But are we allowed to play with definitions? How do we decide what is and what isn’t a data race? Are there some general principles to give us guidance? There is one I mentioned before called…

Sequential Consistency

Ideally, we would like to define data races so that they are always concurrency bugs. This isn’t easy, because we don’t have a definition of a bug. In a higher-level language, like C++, it’s enough to say that a data race leads to undefined behavior. But at a processor level, everything is defined. A processor won’t suddenly stop and say, “I don’t know what to do with this program.”

However, higher level languages give us a hint in terms of the DRF guarantee. DRF stands for Data-Race Freedom, and the guarantee is that if your program is free of data races, it is sequentially consistent (SC). Mind you, an SC program might still be buggy, and even have concurrency bugs, but at least we know how to reason about such programs. We run them in our minds, one action at a time, switching between threads at will, assuming that all writes go directly to main memory and are immediately visible to other threads. We know that this is not what really happens, but if a program is DRF, this is how it behaves.

Not all languages offer the DRF guarantee; C++ supports weak atomics that break SC without introducing data races. Nevertheless, DRF is a good goal.

What does it all mean for the x86? The definition of SC stands at the assembly level, although we have to allow for splitting complex instructions. For instance, INC[x] could be split into three actions: a load, an increment, and a store. If this instruction doesn’t have a LOCK prefix, another thread is free to execute its own actions in between the actions of the current instruction.

A TSO processor, like the x86, would be almost SC, if it weren’t for those pesky write buffers. Still, many memory-access conflicts on an x86 don’t break sequential consistency. For instance, because of the FIFO nature of the write buffers, two conflicting stores don’t break SC, and therefore are not considered a data race. Think of it this way: Inverting two adjacent stores from different threads doesn’t change anything because both stores go to separate write buffers. The external visibility of those stores depends on the order in which the buffers are discharged into global memory, and they are discharged in a sequentially consistent way.

So the only races on the x86 are those between loads and stores, except when both are LOCK’d (in which case they flush their buffers). That’s a good starting point, but a practical question remains: If we report all of those events as data races, how many of them will be considered false positives in real life? Unfortunately, the answer is: too many. It turns out that an efficient implementation of higher-order locks often contains deliberate races. It’s true that the acquisition of a lock usually starts with a LOCK’d intruction — often a CMPXCHG, also known as CAS– but the corresponding release might be implemented using an unsynchronized store. This is why the Jinx definition of a data race excludes conflicts in which even one of the instructions is LOCK’d.

Do we miss some legitimate data races because of that? Perhaps, but we haven’t seen such examples in practice yet. Interestingly, even this narrowed definition of data races flags some legitimate lock-free algorithms, in particular the Linux ultra-fast implementation of a spinlock.

Spinlock

Here’s an actual report produced by Jinx on Linux (Jinx can be used to debug the kernel):

Event: Data Race (3 times)
* Stack A
#0 0xffffffff81036d44 in __ticket_spin_lock at spinlock.h:65
#1 0xffffffff8159e7ee in arch_spin_lock at paravirt.h:744
#2 0xffffffff810909cd in futex_wake at futex.c:891
#3 0xffffffff81093288 in do_futex at futex.c:2602
#4 0xffffffff810933fb in sys_futex at futex.c:2636
* Stack B
#0 0xffffffff81036d74 in __ticket_spin_unlock at spinlock.h:101
#1 0xffffffff81090a58 in arch_spin_unlock at paravirt.h:760
#2 0xffffffff81093288 in do_futex at futex.c:2602
#3 0xffffffff810933fb in sys_futex at futex.c:2636

Two threads, A and B, ran into a race in spinlock.h. One access occurred at line 65 and the other at line 101 of that file. The two stack traces show that the caller, in both cases, was a futex, or a “fast mutex”– a Linux implementation of a thin lock.

Let’s have a look at spinlock.h. Line 65 is inside the function __ticket_spin_lock, which contains a bunch of inline assembly, which I’m not going to analyze here (although I will analyze similar code shortly):

 61 static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 62 {
 63     short inc = 0x0100;
 64
 65     asm volatile (
 66         LOCK_PREFIX "xaddw %w0, %1\n"
 67         "1:\t"
 68         "cmpb %h0, %b0\n\t"
 69         "je 2f\n\t"
 70         "rep ; nop\n\t"
 71         "movb %1, %b0\n\t"
 72         /* don't need lfence here, because loads are in-order */
 73         "jmp 1b\n"
 74         "2:"
 75         : "+Q" (inc), "+m" (lock->slock)
 76         :
 77         : "memory", "cc");
 78 }

The thing to notice is that this is the acquisition of a (ticketed) spinlock, the first instruction of which is indeed LOCK’d. Since Jinx doesn’t flag races with LOCK’d instructions, it must be the next instruction that accesses memory, movb, which is involved in the race. It reads a byte-sized value from memory without using a LOCK.

The other racing line, 101, is inside __ticket_spin_unlock:

 99 static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
100 {
101     asm volatile(UNLOCK_LOCK_PREFIX "incb %0"
102          : "+m" (lock->slock)
103          :
104          : "memory", "cc");
105 }

This is the release part of the ticketed spinlock, and it contains an increment instruction that accesses the memory (the same lock->slock variable as in the previous function). The macro, UNLOCK_LOCK_PREFIX, expands to nothing on processors other than Pentium Pro (which had an error). So we do have an unprotected store (part of incb) that races with an unprotected load (movb in the previous function). And yet, the code is correct. But to understand that, we need to learn about a more sophisticated notion of…

Triangular Data Races

Let’s go back to the x86 TSO memory model. The cores write into their private FIFO buffers. Data moves from these buffers into main memory. At any point in time only one buffer may discharge a previous write entry into memory, so those writes end up interleaved with each other. What’s important is that all processors see those writes appearing in main memory in the same order — the Total Store Order. (To be precise, the processor doing the write will see it immediately due to intra-processor forwarding, but that doesn’t matter in our case.)

Now imagine that you have an actual execution on the x86 and you want to show that it’s equivalent to an SC execution. You could do it, for instance, by rearranging instructions so that stores are moved up to the point where they were originally discharged into global memory. In order to preserve program order on individual processors, you would also have to move up the loads (and other instructions) that appear between those stores. But if you move a load forward too much, it might see a different store than it did in the original execution, and the new SC execution will not be equivalent to the original one.

A typical situation is when stores and loads cross each other, as in this program (x and y originally zero):

P1 P2
mov [x], 1 mov [y], 1
mov eax, [y] mov eax, [x]

If stores from both processors still sit in their respective buffers while the loads are performed, both eax registers will end up with zeros. If we try to move the stores forward to the point where the buffered writes were flushed, we’ll also have to push the loads, and they will end up reading ones instead of zeros. This is clearly a non-SC execution and the races it contains are definitely of interest to us.

Rearranging stores to create an equivalent execution with no buffering. Because program order must be preserved, the loads are rearranged too, resulting in different values being read. The SC execution is not equivalent to the original one. (Time flows from left to right.)

It turns out that all violations of SC on the x86 have one thing in common–a triangular data race. This is a triangular pattern (We have two of those in our example.):

P1 P2 Comment
mov [x], 1 Preceding store
No lock or fence
mov eax, [y] mov [y], 1 The race between a load and a store

More generally, a triangular data race occurs between a load action on one processor and a store action on another if the load is preceded by some other store. In the case above we have a race on y, and what makes it triangular is the odd store to x. The key is that this store may still sit in the write buffer when the racing load is made. If, on the other hand, there is any fence or LOCK/UNLOCK between the odd store and the racing load, we are fine, the race is not triangular.

What’s amazing is that it can be proven that any x86 program that doesn’t contain triangular data races is sequentially consistent. (For a formal definition of a triangular data race and a proof of the theorem, see Owen.) Armed with this knowledge, let’s revisit the spinlock example.

Spinlock and Triangular Races

The Linux ticketed spinlock is an elaboration on a basic spinlock. Here’s the (more readable) Intel assembly code for that simplified spinlock. There is a variable, lck, on which to spin, and its initial value is 1, corresponding to the lock being unclaimed. The address of lck is stored in eax.

Label Instruction Comment
acquire: lock dec [eax] atomic {
tmp = [lck] – 1
[lck] = tmp
flag = (tmp >= 0)
flush local write buffer }
jns enter if (flag) goto enter
spin: cmp [eax], 0 flag = ([lck] <= 0)
jle spin if (flag) goto spin
jmp acquire goto acquire
enter: the critical section starts here

To acquire the lock, we first atomically decrement lck. If the lock was free, the decremented value should be non-negative (actually, zero). If it isn’t, we start spinning, waiting for lck to become positive, and then try to acquire it again.

To release the lock, we set lck back to 1:

release: mov eax, 1 [lck] = 1

Notice that the load part of the instruction cmp [eax],0 is in a data race with mov eax,1. This is the race that is detected by Jinx.

But does this race break sequential consistency? In order to break SC, the race would have to be triangular, which means that the racing load would have to be preceded by some other store that’s not flushed. But the last store before entering the spin loop is part of the LOCK’d instruction, lock dec [eax], which flushes the write buffer. Since there is no triangular race, the spinlock is indeed sequentially consistent.

Conclusion

It’s eye opening to realize how much wiggle room there is in the definition of data race. We have the informal definition: Two simultaneous accesses to the same memory location, one of them being a write. Except that now we have to define “simultaneous,” which is a bit tricky. Simultaneity is characterized by lack of ordering–two events are simultaneous if it’s impossible to tell which came before the other.

Events on different processors may be ordered through synchronizing actions. At the lowest level, these are LOCK’d instructions and fences. However, there is no simple mapping between high-level-language definitions of data races and what is observed at the processor level during execution.

Ultimately, the reason for defining data races is to be able to identify one type of concurrency bugs. But bug detection is in general undecidable, so the best we can do is to define behaviors that are most likely related to programmers’ errors. Most programmers reason about concurrent programs using the sequentially consistent model of execution. This is why the most insidious concurrency bugs are related to the breakdown of sequential consistency. There is a linkage between data races and (the lack of) sequential consistency, which is expressed in terms of the DRF guarantee: A program with no data races behaves in a sequentially consistent way.

We can reverse the reasoning and define data races in a way that would guarantee the DRF property, and this is what is usually done. Seen from that perspective, it’s possible to strengthen the DRF guarantee on TSO processors. The new TRF (Triangular Race Freedom) guarantee states that a program with no triangular data races behaves in a sequentially consistent way and vice versa.

We may reasonably assume that non-triangular data races are not bugs but rather were deliberately coded, either directly by programmers (as was the case with the Linux spinlock), or by a compiler translating a sequentially consistent high-level program. The jury is still out, though, because outside of lock-free programming any race, triangular or not, is still very likely to be a bug.

Acknowledgments

I’d like to thank Anthony Williams, Dmitriy V’yukov, and my co-workers for valuable feedback.

Bibliography

  1. Scott Owens, Susmit Sarkar, Peter Sewell, A better x86 memory model: x86-TSO
  2. Scott Owens, Reasoning about the Implementation of Concurrency Abstractions on x86-TSO

An acquaintance recently observed that his company was concerned more with correctness bugs than with crashing bugs. For them, correctness bugs are more painful to debug than crashing bugs and they also get worse when you make the leap to shared-memory parallel programming (of which threaded programming is one instance). This is a pretty typical story, and it’s worth considering why correctness bugs are so problematic. Of particular interest is why shared-memory parallel programming makes them so much worse.

Corruption is worse than crashing

If a program stops and complains, that’s one thing: an automated system or a human can just start it again. There might be real-time or reliability goals for the program that aren’t being met, but at least you know something went wrong. If it keeps on going, but corrupts the output in some way, there’s a much bigger problem. Most software is based on an implicit assumption of the fail-stop model. This just means that errors in the code stop the program’s forward progress before erroneous data can become nonvolatile: written to disk or communicated to another process. You can see examples of applications that make this assumption all the time, when they trashed their registry entries or made their tables inconsistent, and consequently can’t run any more. At least one study finds that 7% of bugs aren’t fail-stop bugs… that is they produce bad outputs rather than crashing or stopping.

Correctness bugs are hard to fix, hard to detect, and costly

In my acquaintance’s domain, correctness bugs result in subtly wrong graphics. There, the fault tends to get detected, but the error-resolution process is lengthy, as much more of the codebase is suspect. These bugs are detected later in the development process, because earlier stages lack sufficient verification to detect corruption, but crashes are always obvious. In other domains, correctness bugs could mean incorrect chances of success when drilling a test well, an incorrect diagnosis, a wrongly sequenced chromosome, a corrupted database, poor search results, or an ill-advised financial trade. In some of the worst cases, humans or machines take the data out of computation and start committing real-world resources to the results of the computation. Things get really expensive when errors in computation get into nonvolatile storage inside people’s brains.

Some of these bugs are eventually detected, as when you compute an incorrect launch window for a rocket. Others, like choosing not to drill a test well in a fruitful spot, will likely never be detected. The hidden and obvious economic drags of correctness bugs include unhappy customers, missed opportunities, and wasted efforts.

Correctness bugs are more horrible as your software gets more non-deterministic

These bugs get more insidious when you make the leap to shared-memory parallel programming. Whereas in the single-threaded, synchronous world, a given input to a program always yielded the same output, you’re now faced with programs that may give a correct output only a certain percentage of the time. You expect 100%, but it’s not generally possible to prove that a shared-memory parallel program produces deterministic results. This causes trouble for crashing bugs because it makes repros rarer, and allows bugs to creep further through the development process before detection. It also renders debuggers less useful… You can’t just step through until you hit the bug, because sometimes the bug happens, and sometimes it doesn’t, and often the debugger stops it from happening.

The problem is much worse in the case of correctness bugs: both correctness and non-deterministic bugs tend to get detected late in development, and combined, they appear very late. Users of your code can act upon the results because they’re not crashing bugs, and the bug can slip through your test because it’s non-deterministic, even if you have a test case that could have caught the bug! Input coverage isn’t enough any more. How do we deal with this problem?

Dealing with the Problem

Rough classification of programming bugs. Adding assertions to your program can turn correctness bugs into crashing bugs. Limiting shared mutable state reduces non-determinism.

The forces of non-determinism and corruption!

There are three big parts to the answer to this problem. They don’t make the problem go away. They just stop your life from becoming a hell of customer escalations and expensive round-trips to QA.

Choose a simple parallel programming model

First, the way you use multiple cores will change the amount of non-determinism in your software. All the parts of your program that can avoid shared-mutable memory should probably avoid it. For most applications, you’ll never eliminate non-determinism altogether. After all, what is input if not an access of shared-mutable state? But, you can try to guarantee that like inputs yield like outputs by being careful about the sorts of shared-memory parallel programming you do.

Turn your correctness bugs into crashing bugs

Next, increase the likelihood of your bugs being crashing bugs, rather than correctness bugs. The trusty assert() is your friend here. Ship your software with your asserts enabled, and don’t tolerate failing asserts. They tell you where the bug was, and what it was. When you get that bug report from the field, you know exactly what you need to fix. Programs that need to continue to function in the face of failure can use software fault-tolerance approaches like running every computation twice and comparing results, with an optional tie-breaker. Some problems are intrinsically easy to verify, like factoring an integer. Most are not. During test you can use late-stage output verification to achieve fail-stop. Make sure everything does this. By doing this, you’ll change more bugs into being crashing bugs, rather than correctness bugs.

Amplify your testing

Lastly, amplify your non-deterministic bug-find rate. If you expect to deploy software that will run at a rate of tens of thousands of machine-hours per hour, that is, you have tens of thousands of users, or a few users using thousands of computers, you won’t match their ability to find non-deterministic bugs in software without amplifying your bug-find rate. This is a potential success disaster: shipping lots of software will make you realize how buggy it is.

Our goal with Jinx is to let the developer use one computer to test as thoroughly as thousands would, and to let QA engineers make one test pass count for a thousand. To learn more about what we’re doing with Jinx, check out our website!

An interesting preview of things to come in parallel programming: Today was day one of the AMD Fusion Summit in Bellevue. The future, at least according to AMD (and also to Jem Davies from ARM, who was invited for one of the keynote speeches–a partnership, perhaps, between the two big players?), is in heterogeneous chips.

Interestingly, the main argument for heterogeneous computing is not performance, but energy efficiency. And it’s not only about draining battery power in mobile computers. It’s the heat barrier. The more transistors you put on a square inch of silicon, the more heat they produce. Heat dissipation is now the hardest problem facing new generations of computers, from tiny mobiles all the way to server farms and supercomputers.

General purpose CPU’s are great at running some types of programs; GPUs (or their successors, GPGPUs–General Purpose Graphics Processors) are more efficient at others. Being able to move highly parallelizable and vectorizable computations off the CPU and onto a GPU results in more efficient use of energy. So the new series of AMD Fusion processors will combine the x86 cores with high throughput GPU cores.

Those are practical considerations, but there is also a vision of the future, which is hard to ignore. We are all used to multiprocessors by now. Even if your computer or a cellphone doesn’t run highly parallelizable scientific applications, the operating system can easily keep two or four cores busy. It might even be able to take advantage of eight cores. But what about a hundred cores? If you can parallelize your program to such an extent that it saturates a hundred cores, you will most likely be using very specialized algorithms. It’s rather unlikely that you will be creating a hundred tasks each running a different piece of code. Most of them will run the same code on different data sets–the very definition of SIMD (Single Instruction Multiple Data). That’s where GPUs rule. Since you rarely have programs that are entirely SIMD, or entirely MIMD (Multiple Instructions Multiple Data), you want to be able to quickly switch between those modes. That requires some kind of tight integration–hence hybrid multiprocessors.

However, in current hybrids, the offloading of a task from the CPU to a GPU often involves copying large amounts of data and code. So AMD decided to make another bet: shared memory. (I can’t help but to point to one of my earlier blog posts about the future of concurrent programming, in which I made a similar bet). The idea is that all cores, the CPUs and the GPUs will share the same address space. Of course they will have their own caches, both private and shared.

So what about coherency, you might ask? Say hello to weak consistency memory models, complete with atomic instructions (including floating-point compare and swap).

To be totally frank, this is the worst nightmare from the programmers’ point of view. And I don’t blame AMD for that. There just doesn’t seem to be a good programming model for those monsters. AMD adopted the open standard for programming heterogeneous processors called OpenCL, but it’s a pretty low level frameowrk. I’m afraid the life of a programmer might get much harder before it gets any better. For now, it’s back to the drawing board.

Next Page »