In a previous post to this blog, “Dealing with Benign Data Races the C++ Way”, Bartosz examined an apparently benign data race involving a recursively acquirable mutex. In his analysis, he suggested a highly unlikely, but legal, transformation that a compiler could apply to make the race in the mutex less-than-benign. Bartosz’s example transformation involved temporarily writing an arbitrary value (say 42) to a variable (the owner field) that was always guarded by a lock, except for a racy read during acquisition of the recursive mutex.
In his comment on Bartosz’s post, Hans Boehm described a more likely transformation that is very much like one that happened to me, once:
I think a more likely breaking “transformation” here is that the owner field may be misaligned or, for an embedded processor, larger than the atomically updateable word size. In either case, updates could become visible to observers in two sections. This means the observed value could be different from any value actually stored. In unlikely cases, it could match the thread id of the querying thread.
Compilers generally make you jump through hoops if you want to misalign fields inside structures, but it can be done. Through a sequence of terrible accidents, my coworkers and I once produced a piece of code that, while more obscure, was equivalent to the following:
#pragma pack (push, 1) // May cause fields of RecursiveMutex // not to be naturally aligned. class RecursiveMutex { public: RecursiveMutex() : owner(0), count(0), lock() {} bool TryEnter(); void Exit(); private: char garbage; int owner; int count; Lock lock; }; bool RecursiveMutex::TryEnter() { if (owner == /* get thread id */) { // Racy read on this line. count += 1; return true; } if (lock.TryEnter()) { owner = /* get thread id */; return true; } return false; } void RecursiveMutex::Exit() { if (count != 0) { count -= 1; return; } owner = 0; lock.Exit(); }
With the compiler we were using, the #pragma pack (push, 1) declaration caused the fields of RecursiveMutex to be laid out in memory without any padding that would allow for natural field alignment. As a result, the reads and writes of the “owner” field were not guaranteed to be atomic by our compiler and architecture (AMD64). Worse still, because the x86 and AMD64 architectures provide hardware support for unaligned memory accesses, this bug could manifest itself as a non-serializable execution of just two machine instructions.
For example, suppose the assignment “owner = 0” in Exit() compiled to the following x86/AMD64 instruction:
MOV [ecx], 0
Further, suppose that the racy read of “owner” in TryAcquire compiled to this instruction:
MOV eax, [ecx]
If the owner field is not naturally aligned, and the RecursiveMutex instance starts at address 0x1000, the following non-serializable interleaving of two threads is possible:
Seq# | Thread 1 calling Exit | Thread 2 calling TryAcquire |
---|---|---|
1 | Store the first 3 bytes of the constant 0 into [0x1001:0x1003] | |
2 | Read [0x1001:0x1003] into the first 3 bytes of eax | |
3 | Read [0x1004] into the last byte of eax | |
4 | Store the last byte of constant 0 into [0x1004] |
In this case, thread 2 will see a value that should never have been observable in memory: a partial read of the first thread’s write, combined with a partial read of the data present before the write. This value may be equal to the second thread’s ID. However unlikely that may be, the consequences of such a coincidence could be disastrous.
Interestingly, even though the atomic forms of the x86/AMD64 read-modify-write instructions can atomically access data that is not naturally aligned, there is no atomic form for other operations, like MOV, when working with unaligned data.
What is the lesson for a C++ programmer who, for one reason or another, wants to play with lock-free code? If you have access to a C++11 compiler, use atomics. The owner field should be defined as atomic, even if you access it through relaxed memory operations, such as load(memory_order_relaxed). An atomic variable is guaranteed to be naturally aligned. For those of us still using the obsolete compilers, make sure you’re not forcing the compiler to misalign your shared data.
Caveat
While the x86 and AMD64 architectures may treat non-atomically any unlocked memory operation that is not naturally aligned, implementations tend to only show this behavior at certain larger alignment boundaries. For example, some processors treat all memory accesses that fall within a single 64-byte cache line atomically. Others only do so for entries that fall within a 16-byte store-buffer entry. This makes the unsynchronized sharing of data that is not naturally aligned extremely dangerous. It may work on one machine, and not on another. At Corensic, we have observed processors that split at 16-byte boundaries and at 64-byte boundaries.
Bibliography
- Scott Owens, Susmit Sarkar, Peter Sewell, A Better x86 Memory Model: x86-TSO
- Relaxed-Memory Concurrency, (includes papers on non-x86 architectures)
May 23, 2011 at 4:41 pm
I should add that Andy implemented the detection of this kind of data races in Jinx.
May 24, 2011 at 11:12 am
In my mini-quest to help people who write C code (which doesn’t have standards support for threading), I wanted to point out a couple tricks that you can do with MSVC/gcc to avoid issues with pragma pack misaligning variables that you intend to use in benign races.
msvc.h:#define COMPILER_DATA_ALIGN(A) __declspec(align(A))
gcc.h:#define COMPILER_DATA_ALIGN(A) __attribute__((aligned (A)))
typedef struct COMPILER_DATA_ALIGN(sizeof(uint32_t)) {
uint32_t value;
} atomic_uint32_t;
The COMPILER_DATA_ALIGN prevents packing even inside a pragma pack region.
Now accessing this value via load/store functions like in my previous comments (inline asm) you can have C code with the sequential consistency guarantees provided by C++0x atomics.
Yes it’s not portable in that you the programmer needs to custom implementations of all these functions (rather than the C++0x library provider doing it). But it is portable in that one guy in your organization needs to code them once and the rest of your code is unchanged.
May 24, 2011 at 11:17 am
The danger with this is that you must put the COMPILER_DATA_ALIGN(A) declaration _after_ the struct keyword, not in between the typedef and struct keywords, IIRC. If you place it in between, there are circumstances in which the alignment request will not apply. This is actually how I ended up with the packed lock structure that I simplified into this example.
May 26, 2011 at 1:15 pm
[…] covered in a recent blog post about non-serializable single instructions, variants of these instructions without the LOCK prefix are not atomic, so there’s ample […]
June 1, 2011 at 9:25 am
[…] Hans and he gave me a peek at this paper. Essentially, things are as bad or even worse than what Andy and I described in this blog. In particular Hans gave an example that showed how a redundant racy […]
June 7, 2011 at 10:18 am
[…] What does it all mean for the C++11 programmer? It means that there no longer is an excuse for data races. If you need benign data races for performance, rewrite your code using weak atomics. Weak atomics give you the same kind of performance as benign data races but they have well defined semantics. Traditional “benign” races are likely to be broken by optimizing compilers or on tricky architectures. But if you use weak atomics, the compiler will apply whatever means necessary to enforce the correct semantics, and your program will always execute correctly. It will even naturally align atomic variables to avoid torn reads and writes. […]
June 13, 2011 at 8:34 am
[…] even some write operations that update multiple bytes are not guaranteed to be atomic (see Andy’s blog post). For example, if a write operation to multiple bytes happen to cross a cache line boundary, the […]
July 8, 2011 at 10:13 pm
Regarding this remark about C++0x/C++11,
An atomic variable is guaranteed to be naturally aligned.
can you please describe where this is guaranteed? Chapter 29 of the FDIS doesn’t contain the string “align”, at least not that my copy of Acrobat can find, and it’s not apparent where else to look for this guarantee.
Thanks,
Scott
July 9, 2011 at 11:49 am
Hi Scott. You are right, the Standard doesn’t guarantee the alignment of atomic variables and it probably shouldn’t. This detail is left to the implementation. And it’s probably possible to have a standard-compliant implementation that doesn’t enforce alignment but instead uses a lock for every atomic access.
In practice though, on most architectures, the requirement of atomicity either forces proper alignment, or leads to very inefficient code. And the lesson for the C++ programmer to use atomics still holds.
July 9, 2011 at 12:30 pm
Speaking of using a lock for every access, FDIS 29.4/1 says:
The ATOMIC_…_LOCK_FREE macros indicate the lock-free property of the corresponding atomic types, with the signed and unsigned variants grouped together. The properties also apply to the corresponding (partial) specializations of the atomic template. A value of 0 indicates that the types are never lock-free. A value of 1 indicates that the types are sometimes lock-free. A value of 2 indicates that the types are always lock-free.
Do you happen to know the answers to the following questions?
(1) What is the difference between “atomic types” and “(partial) specializations of the atomic template”?
(2) Under what conditions would it make sense for an atomic type to be “sometimes” lock-free?
Thanks,
Scott
July 10, 2011 at 10:23 am
I could speculate about it, but there are people who I’m sure know the answers. I’ll ask Anthony Williams to respond. He wrote a book about this stuff.
July 10, 2011 at 11:37 pm
Regarding Scott’s questions:
(1) atomic_int is not necessarily the same type as atomic (it could be a base class). The “(partial)” bit refers to the fact that atomic is a partial specialization. There is not a corresponding atomic_XXX type for pointers.
(2) The “sometimes” lock free option is given to allow the property to depend on the details of the system the code is actually run on, rather than statically determined. For example, on x86-64 platforms you may or may not have a CMPXCHG16B instruction for performing 128-bit atomic operations. If you don’t have it then you must use a lock, but if you do then your code can be lock-free, and this must be tested at runtime using CPUID. In this instance there isn’t a corresponding ATOMIC_XXX_LOCK_FREE macro, but on other platforms something similar might be the case for types which do have corresponding macros.
August 15, 2011 at 9:36 am
[…] few more details: Several x86 instructions are modeled as a series of actions (see our earlier blog entry). In particular, a LOCK’d instruction can be split into smaller […]
August 25, 2011 at 6:34 am
Not sure what we mean by unlikely in “However unlikely that may be” — at least in the scenario mentioned, you’ve got a 1 in 256 chance of the new value and the original value being the same.
August 25, 2011 at 10:33 am
@Bill: That’s if you’re looking only at the last byte of the second thread ID. But the first three bytes must be zero for the bug to occur. Notice that the previous thread ID stored at that location must have had at least one non-zero bit in those three bytes, otherwise we’d have two different threads with the same ID. So the probability really depends on how many significant bits there are in a thread ID.
February 3, 2012 at 6:57 am
In order to ensure that a system is protected, it is necessary to establish trust boundaries. Data that crosses these boundaries should be sanitized and validated before use. Trust boundaries are also necessary to allow security auditing to be performed efficiently. Code that ensures integrity of trust boundaries must itself be loaded in such a way that its own integrity is assured.