concurrency


I’ve been working on a new malloc implementation for Jinx recently. The last time that a developer at Corensic wrote a version of malloc(), I asked him why he didn’t use a great off-the shelf malloc like tcmalloc. Now it’s my turn, and I also wrote my own. I guess you can accuse me of failing to learn from my mistakes; I’m certainly a hypocrite!

In my defense, a few things about our hypervisor make using someone else’s malloc a pain. First, we’re effectively an embedded system working inside a small chunk of memory allocated in a device driver. The hypervisor works in its own virtual address space, but we never modify that address space. We do some large allocations along the way, but we also have some long-lived smaller memory allocations, so fragmentation is a concern. We code our hypervisor in C. In addition to these constraints, we have one opportunity: we have no pre-emption in our threading package.

I decided to create a pretty simple global allocator, but with the addition of per-physical-processor freelists. Since there’s no pre-emption in our system, we can put freelists in local storage on physical processors, and access the freelists without a lock, without fear of migration.

An example of the global heap data structure follows:

Example of the Jinx hypervisor heap data structure.

An example Jinx heap data structure.

The heap itself is a linked list of regions from HEAD to TAIL. Each element on the linked list is either in-use or free, or is the head or tail of the list. The heap linked list may have no two adjacent free regions (they should be coalesced into one free region). In addition, there are freelist heads for powers of two. For a given freelist head with size S, every item of size P on that freelist obeys S <= P < S * 2.  At the head of every region, there’s a data structure that looks like this:

struct region_header {
    uint64 alloc_state : 2;
    uint64 bytes_to_previous_header : 31;
    uint64 bytes_to_next_header : 31;
    LIST_ENTRY(region_header) freelist;
};

That LIST_SIZE LIST_ENTRY is a FreeBSD macro that represents doubly-linked-list linkage.  alloc_state has one of these values:

enum { ALLOC_STATE_INVALID, ALLOC_STATE_FREE, ALLOC_STATE_ALLOCED,
       ALLOC_STATE_SENTINEL };

Sentinel is used for HEAD and TAIL.

To free an item into the heap, something like this occurs:

free_item(item, heap) {
    item->alloc_state = ALLOC_STATE_FREE;
    coalesce(item, right_neighbor(item));
    coalesce(left_neighbor(item), item);
}

This preserves the heap invariant that there exist no two adjacent free regions.

We use one giant lock for all items in the heap, relying on per-processor freelists to provide concurrency.  Those freelists reuse the freelist field of the header structure for linkage.  Those fields are protected by the global heap lock only as long as alloc_state == ALLOC_STATE_FREE.

When I attempted to make this heap thread-safe, though, I wrote this:

free_item(item, heap) {
    item->alloc_state = ALLOC_STATE_FREE;
    lock();
    coalesce(item, right_neighbor(item));
    coalesce(left_neighbor(item), item);
    unlock();
}

A bug in this code, like many concurrency errors, resulted in intermittent, unusual, crashes and corruption.  It was rare enough that I almost checked it in, but was saved by one of the many unit tests we run prior to commit crashing once.

Now, in retrospect, setting the alloc_state field outside of the lock was a pretty obvious bug.  Imagine that a manipulator of our left-hand neighbor, holding the lock, observed that our region had become free, and coalesced its region with ours. That would be catastrophic, as we’d end up with a pointer to a region that no longer was!  Because regions consult their neighbors’ allocation status during coalescing, that field must be protected by that lock.

However, there’s a more subtle bug lurking here. Consider a different structure definition:

struct region_header {
    uint64 available: 2;
    uint64 alloc_state : 2;
    uint64 bytes_to_previous_header : 30;
    uint64 bytes_to_next_header : 30;
    LIST_ENTRY(region_header) freelist;
};

Let’s say that “available” field is available for use by the user of the malloc interface.  Now we have a problem with non-atomicity of modification. That is, even if that field is owned by the owner of the memory block, not by the holder of the lock, that field can not be modified using normal operations.

void free_item(item, heap) {
    item->available = 2;
    lock();
    coalesce(item, right_neighbor(item));
    coalesce(left_neighbor(item), item);
    unlock();
}

The first part of that code, compiled into x86-64 instructions, might look like

and ~$3, (%rax)
or $2, (%rax)

As covered in a recent blog post about non-serializable single instructions, variants of these instructions without the LOCK prefix are not atomic, so there’s ample opportunity here to clobber ongoing modifications to bitfields that share our storage.  My coworker David points out that it’s more likely to get compiled as this:

mov (%rax), %rdx
and ~$3, %rdx
or $2, %rdx
mov %rdx, (%rax)

Either way, this is a non-atomic update!

This is yet another occasion where local reasoning about code is foiled by concurrency.  Unlike single-threaded code, you have to know the exact structure you’re dealing with to know if modifying available is legal. In addition, you need to know what sort of atomicity your compiler or language guarantees when you’re accessing variables. For example, would this be OK?

struct region_header {
    uint8 available;
    uint8 alloc_state;
    uint16 bytes_to_previous_header;
    uint16 bytes_to_next_header;
    LIST_ENTRY(region_header) freelist;
};

When I modify “available”, how do I know that “bytes_to_previous_header” doesn’t take a read-modify-write trip with me?  Can I use that field without acquiring the global heap lock?  I sent this question out to the rest of our development team, and our UW CSE faculty:  in what languages is a byte read and write guaranteed to be isolated from adjacent fields?  How about a 16 bit number?  How about numbers that are the natural register width?  How about larger?

Consensus was that all languages on all commonly used architectures guarantee that if one reads or writes a byte, adjacent fields don’t go along for the ride, and that this is true, outside of bitfields, for all types up to the register size of a CPU.  Programs would clearly be too hard to reason about were this not true.

Bartosz illustrated this to me for C++ with some language from the C++ spec:

(1.7.3) A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. […] Two threads of execution (1.10) can update and access separate memory locations without interfering with each other.
(1.7.4) [ Note: Thus a bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields of non-zero width. —end note ]
(1.7.5) [ Example: A structure declared as

  struct {
    char a;
    int b:5,
    c:11,
    :0,
    d:8;
    struct {int ee:8;} e;
  }

contains four separate memory locations: The field a and bit-fields d and e.ee are each separate memory locations, and can be modified concurrently without interfering with each other. The bit-fields b and c together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but b and a, for example, can be. —end example ]

I would be delighted to hear about it if people have examples from computers, compilers, and languages, old and new, where this is not true.

One last note on this topic:  In the FreeBSD kernel, the developers have a good habit of documenting what lock protects which field of a structure, via a locking guide.  In their world, that structure might look like this:

/*
 * Locking guide:
 * L:  this field is modified only under the global malloc lock.
 * L/F: When alloc_state == ALLOC_STATE_FREE, L. Else, unlocked field.
 */
struct region_header {
    uint64 alloc_state : 2;                   /* L/F */
    uint64 bytes_to_previous_header : 31;     /* L */
    uint64 bytes_to_next_header : 31;         /* L */
    LIST_ENTRY(region_header) freelist;       /* L/F */
};

In this world, the author of the locking guide would be reprimanded for trying to apply locking rule L/F to the field alloc_state or the hypothetical available field, while other members of the same bitfield had locking rule L.  Any time you document who locks what in a structure, the granularity may not be smaller than that which your compiler and your architecture guarantees, assuming you’re using non-atomic operations to modify the fields!  This is a great practice that has served Corensic well in the development of its hypervisor.  My non-adherence to this practice cost me a few hours in debugging!

So, in summary:

  • The combination of bitfields + concurrency is dangerous, because fields of a structure are implicitly intertwined with adjacent fields.
  • Those problems could be mitigated somewhat by wrapping all bitfields in their own structure.  If you don’t do this (as we don’t), then you are denied local reasoning about fields of structures.
  • For these reasons, converting code into use of bitfields due to space pressure is also an operation that cannot be performed with only local reasoning.  When you change a field of structure to a bitfield, you have to look at uses of that field to ensure that you don’t violate interference assumptions in the code.
  • Use of a locking guide in your C structures will alert you to many of these problems.

In a previous post to this blog, “Dealing with Benign Data Races the C++ Way”, Bartosz examined an apparently benign data race involving a recursively acquirable mutex. In his analysis, he suggested a highly unlikely, but legal, transformation that a compiler could apply to make the race in the mutex less-than-benign. Bartosz’s example transformation involved temporarily writing an arbitrary value (say 42) to a variable (the owner field) that was always guarded by a lock, except for a racy read during acquisition of the recursive mutex.

In his comment on Bartosz’s post, Hans Boehm described a more likely transformation that is very much like one that happened to me, once:

I think a more likely breaking “transformation” here is that the owner field may be misaligned or, for an embedded processor, larger than the atomically updateable word size. In either case, updates could become visible to observers in two sections. This means the observed value could be different from any value actually stored. In unlikely cases, it could match the thread id of the querying thread.

Compilers generally make you jump through hoops if you want to misalign fields inside structures, but it can be done. Through a sequence of terrible accidents, my coworkers and I once produced a piece of code that, while more obscure, was equivalent to the following:

#pragma pack (push, 1)  // May cause fields of RecursiveMutex
                        // not to be naturally aligned.

class RecursiveMutex {
public:
    RecursiveMutex() : owner(0), count(0), lock() {}
    bool TryEnter();
    void Exit();

private:
    char garbage;
    int owner;
    int count;
    Lock lock;
};

bool RecursiveMutex::TryEnter() {
    if (owner == /* get thread id */) {  // Racy read on this line.
        count += 1;
        return true;
    }

    if (lock.TryEnter()) {
        owner = /* get thread id */;
        return true;
    }
    return false;
}

void RecursiveMutex::Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner = 0;
    lock.Exit();
}

With the compiler we were using, the #pragma pack (push, 1) declaration caused the fields of RecursiveMutex to be laid out in memory without any padding that would allow for natural field alignment. As a result, the reads and writes of the “owner” field were not guaranteed to be atomic by our compiler and architecture (AMD64). Worse still, because the x86 and AMD64 architectures provide hardware support for unaligned memory accesses, this bug could manifest itself as a non-serializable execution of just two machine instructions.

For example, suppose the assignment “owner = 0” in Exit() compiled to the following x86/AMD64 instruction:

MOV [ecx], 0

Further, suppose that the racy read of “owner” in TryAcquire compiled to this instruction:

MOV eax, [ecx]

If the owner field is not naturally aligned, and the RecursiveMutex instance starts at address 0x1000, the following non-serializable interleaving of two threads is possible:

Seq# Thread 1 calling Exit Thread 2 calling TryAcquire
1 Store the first 3 bytes of the constant 0 into [0x1001:0x1003]
2 Read [0x1001:0x1003] into the first 3 bytes of eax
3 Read [0x1004] into the last byte of eax
4 Store the last byte of constant 0 into [0x1004]

In this case, thread 2 will see a value that should never have been observable in memory: a partial read of the first thread’s write, combined with a partial read of the data present before the write. This value may be equal to the second thread’s ID. However unlikely that may be, the consequences of such a coincidence could be disastrous.

Interestingly, even though the atomic forms of the x86/AMD64 read-modify-write instructions can atomically access data that is not naturally aligned, there is no atomic form for other operations, like MOV, when working with unaligned data.

What is the lesson for a C++ programmer who, for one reason or another, wants to play with lock-free code? If you have access to a C++11 compiler, use atomics. The owner field should be defined as atomic, even if you access it through relaxed memory operations, such as load(memory_order_relaxed). An atomic variable is guaranteed to be naturally aligned. For those of us still using the obsolete compilers, make sure you’re not forcing the compiler to misalign your shared data.

Caveat

While the x86 and AMD64 architectures may treat non-atomically any unlocked memory operation that is not naturally aligned, implementations tend to only show this behavior at certain larger alignment boundaries.  For example, some processors treat all memory accesses that fall within a single 64-byte cache line atomically.  Others only do so for entries that fall within a 16-byte store-buffer entry.  This makes the unsynchronized sharing of data that is not naturally aligned extremely dangerous.  It may work on one machine, and not on another.  At Corensic, we have observed processors that split at 16-byte boundaries and at 64-byte boundaries.

Bibliography

  1. Scott Owens, Susmit Sarkar, Peter Sewell, A Better x86 Memory Model: x86-TSO
  2. Relaxed-Memory Concurrency, (includes papers on non-x86 architectures)

Data races lead to undefined behavior; but how bad can they really be? In my previous post I talked about benign data races and I gave several examples taken from the Windows kernel. Those examples worked because the kernel was compiled with a specific compiler for a specific processor. But in general, if you want your code to be portable, you can’t have data races, period.

You just cannot reason about something that is specifically defined to be “undefined.” So, obviously, you cannot prove the correctness of a program that has data races. However, very few people engage in proofs of correctness. In most cases the argument goes, “I can’t see how this can fail, therefore it must be correct” (maybe not in so many words). I call this “proof by lack of imagination.” If you want to become a concurrency expert, you have to constantly stretch your imagination. So let’s do a few stretches.

One of the readers of my previous post, Duarte Nunes, posted a very interesting example of a benign data race. Here’s the code:

int owner = 0;
int count = 0;
Lock lock;

bool TryEnter() {
    if (owner == /* get thread id */) {
        count += 1;
        return true;
    }

    if (lock.TryEnter()) {
        owner = /* get thread id */;
        return true;
    }
    return false;
}

void Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner = 0;
    lock.Exit();
}

I highlighted in blue the parts that are executed under the lock (in a correct program, ‘Exit’ will always be called after the lock has been acquired). Notice that the variable ‘count’ is always accessed under the lock, so no data races there. However, the variable ‘owner’ may be read outside of the lock– I highlighted that part of code in red. That’s the data race we are talking about.

Try to analyze this code and imagine what could go wrong. Notice that the compiler or the processor can’t be too malicious. The code still has to work correctly if the data race is removed, for instance if the racy read is put under the lock.

Here’s an attempt at the “proof” of correctness. First, Duarte observed that “The valid values for the ‘owner’ variable are zero and the id of any thread in the process.” That sort of makes sense, doesn’t it? Now, the only way the racy read can have any effect is if the value of ‘owner’ is equal to the current thread’s ID. But that’s exactly the value that could have been written only by the current thread– and under the lock.

There are two possibilities when reading ‘owner’: either we are still under that lock, in which case the read is not at all racy; or we have already released the lock. But the release of the lock happens only after the current thread zeroes ‘owner.’

Notice that this is a single-thread analysis and, within a single thread, events must be ordered (no need to discuss memory fences or acquire/release semantics at this point). A read following a write in the same thread cannot see the value that was there before the write. That would break regular single-threaded programs. Of course, other threads may have overwritten this zero with their own thread IDs, but never with the current thread ID. Or so the story goes…

Brace yourself: Here’s an example how a compiler or the processor may “optimize” the program:

void Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner = 42;
    owner = 0;
    lock.Exit();
}

You might argue that this is insane and no compiler in its right mind would do a thing like this; but the truth is: It’s a legal program transformation. The effect of this modification is definitely not observable in the current thread. Neither is it observable by other threads in the absence of data races. Now, the unfortunate thread whose ID just happens to be 42 might observe this value and take the wrong turn. But it can only observe it through a racy read. For instance, it would never see this value if it read ‘owner’ under the lock. Moreover, it would never see it if the variable ‘owner’ were defined as ‘atomic’:

std::atomic<int> owner = 0

Stores and loads of atomic variables are, by default, sequentially consistent. Unfortunately sequential consistency, even on an x86, requires memory fences, which can be quite costly. It would definitely be an overkill in our example. So here’s the trick: Tell the compiler to forgo sequential consistency on a per read/write basis. For instance, here’s how you read an atomic variable without imposing any ordering constraints:

owner.load(memory_order_relaxed)

Such ‘relaxed’ operations will not introduce any memory fences– at least not on any processor I know about.

Here’s the version of the code that is exactly equivalent to the original, except that it’s correct and portable:

std::atomic<int> owner = 0;
int count = 0;
Lock lock;

bool TryEnter() {
    if (owner.load(memory_order_relaxed) == /* get thread id */) {
        count += 1;
        return true;
    }

    if (lock.TryEnter()) {
        owner.store(/* get thread id */, memory_order_relaxed);
        return true;
    }
    return false;
}

void Exit() {
    if (count != 0) {
        count -= 1;
        return;
    }
    owner.store(0, memory_order_relaxed);
    lock.Exit();
}

So what has changed? Can’t the compiler still do the same dirty trick on us, and momentarily store 42 in the ‘owner’ variable? No, it can’t! Since the variable is declared ‘atomic,’ the compiler can no longer assume that the write can’t be observed by other threads.

The new version has no data races: The (draft) Standard specifically states that ‘atomic’ variables don’t contribute to data races, even in their most relaxed form.

C++ Draft Standard, (1.10.5):
[…] “Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races.

With those changes, I believe that our “proof” of correctness may actually be turned into a more rigorous proof using the axioms of the C++ memory model (although I’m not prepared to present one). We can have our cake (correct, portable code) and eat it too (no loss of performance). And, by the way, the same trick may be used in the case of lossy counters from my previous post.

Warning: I do not recommend this style of coding, or the use of weak atomics, to anybody who is not implementing operating system kernels or concurrency libraries.

Acknowledgments

I’d like to thank Luis Ceze, Hans Boehm, and Anthony Williams for useful remarks and for verifying my assumptions about the C++ memory model.

Bibliography

  1. C++ Draft Standard

Announcement

In the effort to raise the awareness about concurrent programming, I’ll be hosting a free webinar, “The Language of Concurrency,” Tuesday May 24, 2011 at 12pm PDT. It will be a broad overview of the domain and the explanation of common terms, such as message passing, data races, atomics, sequential consistency, etc. Nothing sophisticated, but you might recommend it to your coworkers who are still learning the ropes or need a refresher. You can pre-register here, get an email reminder, and download the slides. If this is successful, I’ll start working on some more advanced topics (suggestions are welcome).

Can a data race not be a bug? In the strictest sense I would say it’s always a bug. A correct program written in a high-level language should run the same way on every processor present, past, and future. But there is no proscription, or even a convention, about what a processor should (or shouldn’t) do when it encounters a race. This is usually described in higher-level language specs by the ominous phrase: “undefined behavior.” A data race could legitimately reprogram your BIOS, wipe out your disk, and stop the processor’s fan causing a multi-core meltdown.

Data race: Multiple threads accessing the same memory location without intervening synchronization, with at least one thread performing a write.

However, if your program is only designed to run on a particular family of processor, say the x86, you might allow certain types of data races for the sake of performance. And as your program matures, i.e., goes through many cycles of testing and debugging, the proportion of buggy races to benign races keeps decreasing. This becomes a real problem if you are using a data-race detection tool that cannot distinguish between the two. You get swamped by false positives.

Microsoft Research encountered and dealt with this problem when running their race detector called DataCollider on the Windows kernel (see Bibliography). Their program found 25 actual bugs, and almost an order of magnitude more benign data races. I’ll summarize their methodology and discuss their findings about benign data races.

Data Races in Windows Kernel

The idea of the program is very simple. Put a hardware breakpoint on a shared memory access and wait for one of the threads to stumble upon it. This is a code breakpoint, which is triggered when the particular code location is executed. The x86 also supports data breakpoints, which are triggered when the program accesses a specific memory location. So when a thread hits the code breakpoint, DataCollider installs a data breakpoint on the location the thread was just accessing. It then stalls the current thread and let all other threads run. If any one of them hits the data breakpoint, it’s a race (as long as one of the accesses is a write). Consider this: If there was any synchronization between the two accesses, the second thread would have been blocked from accessing that location. Since it wasn’t, we have a classic data race.

Notice that this method might not catch all data races, but it doesn’t produce false positives. Except, of course, when the race is considered benign.

There are other interesting details of the algorithm. One is the choice of code locations for installing breakpoints. DataCollider first analyzes the program’s assembly code to create a pool of memory accesses. It discards all thread-local accesses and explicitly synchronized instructions (for instance, the ones with the LOCK prefix). It then randomly picks locations for breakpoints from this pool. Notice that rarely executed paths are as likely to be sampled as the frequently executed ones. This is important because data races often hide in less frequented places.

Pruning Benign Races

90% of data races caught by DataCollider in the Windows kernel were benign. For several reasons it’s hard to say how general this result is. First, the kernel had already been tested and debugged for some time, so many low-hanging concurrency bugs have been picked. Operating system kernels are highly optimized for a particular processor and might use all kinds of tricks to improve performance. Finally, kernels often use unusual synchronization strategies. Still, it’s interesting to see what shape benign data races take.

It turns out that half of false positives came from lossy counters. There are many places where statistics are gathered: counting certain kinds of events, either for reporting or for performance enhancements. In those situations losing a few increments is of no relevance. However not all counters are lossy and, for instance, a data race in reference counting is a serious bug. DataCollider uses simple heuristic to detect lossy counters–they are the ones that are always incremented. A reference counter, on the other hand, is as often incremented as decremented.

Another benign race happens when one thread reads a particular bit in a bitfield while another thread updates another bit. A bit update is a read-modify-write (RMW) sequence: The thread reads the previous value of the bitfield, modifies one bit, and writes the whole bitfield back. Other bits are overwritten in the process too, but their new values are the same as the old values. A read from another thread of any of the the non-changed bits does not interfere with the write, at least not on the x86. Of course if yet another thread modified one of those bits, it would be a real bug, and it would be caught separately. The pruning of this type of race requires analysis of surrounding code (looking for the masking of other bits).

Windows kernel also has some special variables that are racy by design–current time is one such example. DataCollider has these locations hard-coded and automatically prunes them away.

There are benign races that are hard to prune automatically, and those are left for manual pruning (in fact, DataCollider reports all races, it just de-emphasizes the ones it considers benign). One of them is the double-checked locking pattern (DCLP), where a thread makes a non-synchronized read to be later re-confirmed under the lock. This pattern happens to work on the x86, although it definitely isn’t portable.

Finally, there is the interesting case of idempotent writes— two racing writes that happen to write the same value to the same location. Even though such scenarios are easy to prune, the implementers of DataCollider decided not to prune them because more often than not they led to the uncovering of concurrency bugs. Below is a table that summarizes various cases.

Benign race Differentiate from Pruned?
Lossy counter Reference counting Yes
Read and write of different bits Read and write of the whole word Yes
Deliberately racy variables Yes
DCLP No
Idempotent writes No

Conclusion

In the ideal world there would be no data races. But a concurrency bug detector must take into account the existence of benign data races. In the early stages of product testing the majority of detected races are real bugs. It’s only when chasing the most elusive of concurrency bugs that it becomes important to weed out benign races. But it’s the elusive ones that bite the hardest.

Bibliography

  1. John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk, Effective Data-Race Detection for the Kernel

« Previous Page