The Cost of Debugging Software

Have you ever spent a day tracking down a program error, only to find it was something trivial deep inside the code? It takes a surprising amount of time to track down this sort of error. In theory bugs in low-level code get found and fixed quickly as development proceeds, but in practice a few get left behind.

Programmers debug programs by interrupting execution and examining the currentstate of the program. Modern processors execute millions if not billions of instructions per second, so it is not practical to watch the state of every one. As a result, "single stepping" through a program really means single-stepping through lines of code, skipping over vast numbers of instructions.

Why is this so? Any significant program is split into many different units - classes and other data structures, functions and procedures, modules and subsystems. Often a bug first appears after a developer makes a change to a portion of that program, and it is natural to expect that the error is in the newly changed code. Thus the developer will not step into function calls and object methods that are not part of the current module. This allows a quick evaluation of the code most likely to contain the bug.

Even if the bug is in the new code, the error may result from an unexpected program state - a condition not considered - or a bug elsewhere. The developer won't know about this until after a function call returns, and by then it is too late - the damage is done. Debugger "undo" features can be helpful, but if the problem was caused by lower-level code, it is likely that too many instructions have executed since the error (or unexpected state) occurred. Often the only practical answer is to restart program execution from the beginning.

To find a problem in a lower-level routine, the developer must set a break point at the call to that routine, then step into the routine. Now the process begins again: the developer must observe the state of the program for each line of the lower-level routine. Of course, the same problem can occur in the lower-level routine; the problem could arise still lower in the code, requiring another restart, another break point, and another round of single-stepping.

In theory all of this progresses smoothly: an error three levels down requires three separate program runs, three single-step function executions, and three sets of program evaluations during each execution. Reality is not so easy. The lower a function is in the call chain, the more likely it is to be used in many different places. A break point in a low-level routine is likely to be triggered dozens of times, if not far more. To avoid hitting that many break points, the programmer must temporarily disable every break point but the one at the highest level, then selectively re-enable lower level break points as execution descends to the point where the trouble is believed to occur.

Managing so many break points and the time at which they are to be enabled is itself tedious and error-prone. If the developer enables break points too quickly, debugging will take much longer, and it is easy to skip over the one important call after skipping over dozens of calls to the same routine that are not important. The developer must also keep track of the state of the program at each level so that he/she can reason about the execution and determine what is going wrong.

It has been my experience that it takes about three times as long to find a bug in a second-level function (i.e. one level below the code currently in development) as it does to find a bug in a first-level function (i.e. the code currently in development). This pattern holds true for every level, so a bug in a third-level function takes about nine times as long, a bug in a fourth-level function takes about 27 times as long etc. It doesn't take very many of these bugs to dominate overall program debug time.

Complex analytical or analysis code such as that found in the Electronic Design Automation (EDA) field typically has many nested function levels. Every level of separation reduces controllability and testability, so not only are the bugs harder to trace, but they are harder to trigger. This leads to bugs that are not found until after the software is deployed. And they might not cause program crashes either - they might simply degrade the quality of results, leading to competitive disadvantage.

The key lesson, of course, is to find and fix as many of the of the bugs in the lower-level code as possible before using it in higher-level code. All too often, code is tested when it is all done, leading to a frustrating period during which nothing seems to work.

I've found the following to be true, time and time again:

No code is too trivial to test. You need reliable base code to develop higher-level code productively. Even if code can be deemed correct by inspection, it might be modified later or it might behave differently on another platform. Testing at all levels allows immediate isolation of bugs.
Test code should never be thrown away, even if it is very stripped-down compared to the real application code. If the test code is "in the way" of the application code such that it must be removed before proceeding with further development, the module's architecture is wrong. You need more controllability and observability so that direct testing of the module can continue.
Testing should be part of the build process, not something run on an occasional (such as hourly or daily) basis. This forces you to write tests that can be executed quickly. With testing at all levels of code, I find that build times are roughly doubled: five minutes on a single CPU to completely rebuild and test 300,000 lines of code. That's five minutes for full regression testing, no matter where an enhancement or fix was made.
Standalone test programs for each module allow you to add test cases for low-level bugs directly, meaning that fixes are validated for all time, regardless of changes at higher levels. This is especially important for optimization programs where a series of operations are performed before the one which triggers a bug. Improvements in optimization may change the system state enough that the trigger would no longer apply.

Writing complex analytical or optimization software is hard enough; debugging it should not be any harder. All too often testing is rushed to compensate for development schedule slips, leading to product quality problems. A robust testing methodology applied at all levels can keep testing time under control and maintain high software quality.

Chapman Consulting

Software Development Done Right.