In a previous post to this blog, “Dealing with Benign Data Races the C++ Way”, Bartosz examined an apparently benign data race involving a recursively acquirable mutex. In his analysis, he suggested a highly unlikely, but legal, transformation that a compiler could apply to make the race in the mutex less-than-benign. Bartosz’s example transformation involved temporarily writing an arbitrary value (say 42) to a variable (the owner field) that was always guarded by a lock, except for a racy read during acquisition of the recursive mutex.

In his comment on Bartosz’s post, Hans Boehm described a more likely transformation that is very much like one that happened to me, once:

I think a more likely breaking “transformation” here is that the owner field may be misaligned or, for an embedded processor, larger than the atomically updateable word size. In either case, updates could become visible to observers in two sections. This means the observed value could be different from any value actually stored. In unlikely cases, it could match the thread id of the querying thread.

Compilers generally make you jump through hoops if you want to misalign fields inside structures, but it can be done. Through a sequence of terrible accidents, my coworkers and I once produced a piece of code that, while more obscure, was equivalent to the following:

#pragma pack (push, 1)  // May cause fields of RecursiveMutex
                        // not to be naturally aligned.

class RecursiveMutex {
    RecursiveMutex() : owner(0), count(0), lock() {}
    bool TryEnter();
    void Exit();

    char garbage;
    int owner;
    int count;
    Lock lock;

bool RecursiveMutex::TryEnter() {
    if (owner == /* get thread id */) {  // Racy read on this line.
        count += 1;
        return true;

    if (lock.TryEnter()) {
        owner = /* get thread id */;
        return true;
    return false;

void RecursiveMutex::Exit() {
    if (count != 0) {
        count -= 1;
    owner = 0;

With the compiler we were using, the #pragma pack (push, 1) declaration caused the fields of RecursiveMutex to be laid out in memory without any padding that would allow for natural field alignment. As a result, the reads and writes of the “owner” field were not guaranteed to be atomic by our compiler and architecture (AMD64). Worse still, because the x86 and AMD64 architectures provide hardware support for unaligned memory accesses, this bug could manifest itself as a non-serializable execution of just two machine instructions.

For example, suppose the assignment “owner = 0” in Exit() compiled to the following x86/AMD64 instruction:

MOV [ecx], 0

Further, suppose that the racy read of “owner” in TryAcquire compiled to this instruction:

MOV eax, [ecx]

If the owner field is not naturally aligned, and the RecursiveMutex instance starts at address 0x1000, the following non-serializable interleaving of two threads is possible:

Seq# Thread 1 calling Exit Thread 2 calling TryAcquire
1 Store the first 3 bytes of the constant 0 into [0x1001:0x1003]
2 Read [0x1001:0x1003] into the first 3 bytes of eax
3 Read [0x1004] into the last byte of eax
4 Store the last byte of constant 0 into [0x1004]

In this case, thread 2 will see a value that should never have been observable in memory: a partial read of the first thread’s write, combined with a partial read of the data present before the write. This value may be equal to the second thread’s ID. However unlikely that may be, the consequences of such a coincidence could be disastrous.

Interestingly, even though the atomic forms of the x86/AMD64 read-modify-write instructions can atomically access data that is not naturally aligned, there is no atomic form for other operations, like MOV, when working with unaligned data.

What is the lesson for a C++ programmer who, for one reason or another, wants to play with lock-free code? If you have access to a C++11 compiler, use atomics. The owner field should be defined as atomic, even if you access it through relaxed memory operations, such as load(memory_order_relaxed). An atomic variable is guaranteed to be naturally aligned. For those of us still using the obsolete compilers, make sure you’re not forcing the compiler to misalign your shared data.


While the x86 and AMD64 architectures may treat non-atomically any unlocked memory operation that is not naturally aligned, implementations tend to only show this behavior at certain larger alignment boundaries.  For example, some processors treat all memory accesses that fall within a single 64-byte cache line atomically.  Others only do so for entries that fall within a 16-byte store-buffer entry.  This makes the unsynchronized sharing of data that is not naturally aligned extremely dangerous.  It may work on one machine, and not on another.  At Corensic, we have observed processors that split at 16-byte boundaries and at 64-byte boundaries.


  1. Scott Owens, Susmit Sarkar, Peter Sewell, A Better x86 Memory Model: x86-TSO
  2. Relaxed-Memory Concurrency, (includes papers on non-x86 architectures)