An acquaintance recently observed that his company was concerned more with correctness bugs than with crashing bugs. For them, correctness bugs are more painful to debug than crashing bugs and they also get worse when you make the leap to shared-memory parallel programming (of which threaded programming is one instance). This is a pretty typical story, and it’s worth considering why correctness bugs are so problematic. Of particular interest is why shared-memory parallel programming makes them so much worse.

Corruption is worse than crashing

If a program stops and complains, that’s one thing: an automated system or a human can just start it again. There might be real-time or reliability goals for the program that aren’t being met, but at least you know something went wrong. If it keeps on going, but corrupts the output in some way, there’s a much bigger problem. Most software is based on an implicit assumption of the fail-stop model. This just means that errors in the code stop the program’s forward progress before erroneous data can become nonvolatile: written to disk or communicated to another process. You can see examples of applications that make this assumption all the time, when they trashed their registry entries or made their tables inconsistent, and consequently can’t run any more. At least one study finds that 7% of bugs aren’t fail-stop bugs… that is they produce bad outputs rather than crashing or stopping.

Correctness bugs are hard to fix, hard to detect, and costly

In my acquaintance’s domain, correctness bugs result in subtly wrong graphics. There, the fault tends to get detected, but the error-resolution process is lengthy, as much more of the codebase is suspect. These bugs are detected later in the development process, because earlier stages lack sufficient verification to detect corruption, but crashes are always obvious. In other domains, correctness bugs could mean incorrect chances of success when drilling a test well, an incorrect diagnosis, a wrongly sequenced chromosome, a corrupted database, poor search results, or an ill-advised financial trade. In some of the worst cases, humans or machines take the data out of computation and start committing real-world resources to the results of the computation. Things get really expensive when errors in computation get into nonvolatile storage inside people’s brains.

Some of these bugs are eventually detected, as when you compute an incorrect launch window for a rocket. Others, like choosing not to drill a test well in a fruitful spot, will likely never be detected. The hidden and obvious economic drags of correctness bugs include unhappy customers, missed opportunities, and wasted efforts.

Correctness bugs are more horrible as your software gets more non-deterministic

These bugs get more insidious when you make the leap to shared-memory parallel programming. Whereas in the single-threaded, synchronous world, a given input to a program always yielded the same output, you’re now faced with programs that may give a correct output only a certain percentage of the time. You expect 100%, but it’s not generally possible to prove that a shared-memory parallel program produces deterministic results. This causes trouble for crashing bugs because it makes repros rarer, and allows bugs to creep further through the development process before detection. It also renders debuggers less useful… You can’t just step through until you hit the bug, because sometimes the bug happens, and sometimes it doesn’t, and often the debugger stops it from happening.

The problem is much worse in the case of correctness bugs: both correctness and non-deterministic bugs tend to get detected late in development, and combined, they appear very late. Users of your code can act upon the results because they’re not crashing bugs, and the bug can slip through your test because it’s non-deterministic, even if you have a test case that could have caught the bug! Input coverage isn’t enough any more. How do we deal with this problem?

Dealing with the Problem

Rough classification of programming bugs. Adding assertions to your program can turn correctness bugs into crashing bugs. Limiting shared mutable state reduces non-determinism.

The forces of non-determinism and corruption!

There are three big parts to the answer to this problem. They don’t make the problem go away. They just stop your life from becoming a hell of customer escalations and expensive round-trips to QA.

Choose a simple parallel programming model

First, the way you use multiple cores will change the amount of non-determinism in your software. All the parts of your program that can avoid shared-mutable memory should probably avoid it. For most applications, you’ll never eliminate non-determinism altogether. After all, what is input if not an access of shared-mutable state? But, you can try to guarantee that like inputs yield like outputs by being careful about the sorts of shared-memory parallel programming you do.

Turn your correctness bugs into crashing bugs

Next, increase the likelihood of your bugs being crashing bugs, rather than correctness bugs. The trusty assert() is your friend here. Ship your software with your asserts enabled, and don’t tolerate failing asserts. They tell you where the bug was, and what it was. When you get that bug report from the field, you know exactly what you need to fix. Programs that need to continue to function in the face of failure can use software fault-tolerance approaches like running every computation twice and comparing results, with an optional tie-breaker. Some problems are intrinsically easy to verify, like factoring an integer. Most are not. During test you can use late-stage output verification to achieve fail-stop. Make sure everything does this. By doing this, you’ll change more bugs into being crashing bugs, rather than correctness bugs.

Amplify your testing

Lastly, amplify your non-deterministic bug-find rate. If you expect to deploy software that will run at a rate of tens of thousands of machine-hours per hour, that is, you have tens of thousands of users, or a few users using thousands of computers, you won’t match their ability to find non-deterministic bugs in software without amplifying your bug-find rate. This is a potential success disaster: shipping lots of software will make you realize how buggy it is.

Our goal with Jinx is to let the developer use one computer to test as thoroughly as thousands would, and to let QA engineers make one test pass count for a thousand. To learn more about what we’re doing with Jinx, check out our website!