Software Design for Testability

Making It Easier to Find and Fix Bugs

All software has bugs, especially when it is first written. Even if there is a formal proof that the implementation meets the specification, the specification can have bugs. The only way to be sure that the software does what you want is to test it with real data.

Traditionally this is done by running an application, then observing the results. Developers run the application themselves and fix every problem they find. At some point, it is released to select alpha testers for usability evaluation. The broader audience is likely to stress the software in different ways, exposing more bugs. Eventually, a production candidate is released to a wider set of beta testers and final bug fixes are made.

This is an intense time for developers. Bugs must be analyzed and solutions proposed. Often a user cannot continue evaluation until a reported bug is fixed, so a large fraction of all bugs found during evaluation are on the release's critical path.

Plots of bugs found and fixed over time are a staple of the release cycle. There is a spike when alpha code is released, another when beta code is released, and often (oops!) a spike when production code is released. Advance to the next release stage typically occurs when the bug rate for the current stage drops to a steady value, meaning that more users are needed to find more bugs quicker.

You thought I was going to say that the quality level had reached an acceptable level, didn't you? That's not the way it works. The standard methodology is just a way to maximize the rate at which bugs are found so they can be fixed. Software vendors don't want their developers waiting for bugs to be found. The resulting methodology acknowledges that a significant number of bugs will be in the final released code.

What if there was a better way? What if there was a way for developers to fix the vast majority of all bugs before the alpha release? What if the spike in the bug rate at each release point was just a blip?

Proactive Testing Methodologies

Numerous books have been written about the software testing crisis, and I'm hoping I don't have to write another. I write mostly computationally intensive software, the kind that seemingly has been most resistant to testing, and the methods I describe here are the ones I have used for 20+ years and 1,200,000+ lines of highly technical number-crunching code in the Electronic Design Automation (EDA) industry.

The first step to solving your software testing crisis is to agree that you should at least try to fix every bug before releasing your software. Acceptance of a "low" rate of software bugs is submission to mediocrity. Zero is the standard.

Extreme Programming has gained adherents; it uses a "test first" methodology which tries to ensure that no bugs exist in the code written each day (or other time period). To this end, it partitions work into "features," each of which can be tested and implemented in the specified time period. The tests for a given feature are written first, and because the code does not yet exist, they fail. When the tests pass, the code for that feature is done.

Extreme Programming didn't exist when I started my work, and even now I haven't been able to use it because:

In Research and Development (R&D), you sometimes have to experiment and explore alternatives. You don't always know exactly what will work; the scope of the "feature" may shrink if it turns out not to be feasible.
Features can be complex; you won't necessarily finish one in a day (or a week).
The support code for a user-visible feature might have several layers, none of which are directly visible to customers.
Feature-based testing does not necessarily exercise every line of the code.

One value I share with Extreme Programming advocates is that test code should never be thrown away. When you keep your test code and make it part of your build process, you avoid repeated rounds of manual testing in every release cycle. Computers do most of your testing each release, not humans. This is a huge efficiency gain.

Like the Extreme Programming methodology, I use standalone test programs. I run them as part of the system build process. When I run "make all_tests" in a subsystem directory, every test executable in the directory is rebuilt if needed and run. If any program finds an error, it returns a non-zero exit code and the build script halts there. This keeps me from missing an error.

Some people embed self-test code within their modules, either as a separate test Applications Programming Interface (API) or a conditionally compiled main routine. Both methods make the module bloated; code that has nothing to do with the normal execution of the subsystem is shipped with it. Conditional compilation of test code means that the code you test is not the same as the code you ship (the release build is made with test code turned off), imposing some risk that it will not behave the same way.

The SQLite project (http://www.sqlite.org/testing.html) also has a 100% statement coverage goal, but they do it with application-level tests. It is harder to control lower-level code this way, so the amount of test code is significantly larger (over 1000 times larger) than the product's code base. They see this as a plus; I disagree. Test cases have to be maintained just like the product code base, and that much test code becomes a burden. My test methodologies require about one line of test code for each line of product code. They run faster, as well; the tests do not require full application startup each time.

Standalone Test Programs

Test code does not have the same purpose as applications code, so it should be in a separate module. I find that full testing (see Yes, You Can Test Every Line of Code) requires about as much code as the module being tested, so to limit the size of test programs, every module has its own standalone test program. This means I can run a debug session that exercises only that module, but it does mean there is no one test entry point for the subsystem. A series of programs have to be run to test everything in the subsystem.

Compilation speed is roughly proportional to the number of lines of code being compiled, so doubling the number of lines of code doubles the compile time. A single processor core requires 11 minutes to compile and test my research project, which is 500,000 lines of code. Most tests run very quickly because they set up a small example and make incremental changes to it. They do not run a full optimization. In fact, the only test programs that require any significant amount of time are the ones that test application-level code, which does perform full optimization.

The test programs generally print a limited amount of data, because that would require comparison against a "golden" output; often they simply report "all tests succeeded." I try not to print "golden" outputs for scientific and engineering software because scientific number formats tend to be machine-dependent. (I finally wrote my own printf() equivalent in part to deal with this problem.)

When a test program is run, it generally does not have any command-line parameters. This reduces the complexity of build scripts. All test data is either embedded in the test program or stored in read-only files whose names are hard-coded within the test program.

Test programs may generate output files (especially if they exercise application-level code, not lower levels), but the output files are generally deleted before the test programs exit. Like "golden" outputs, output files can be machine-dependent unless you are very careful. Direct data structure checks should not be machine-dependent - if they are, you have just found a bug in your code!

A test program exercises its target module from the bottom up. Testing is all about dependability and trustworthiness, and if you haven't tested the lower level code yet, you can't depend on it yet.

Architecting Software for Testability

The first thing you want to remember about testable software is:

controllability + observability = testability

To test software completely, you must be able to control it completely and observe it completely. These concepts are well-defined in digital design - half of all Register Transfer Level (RTL) code in Verilog or VHDL is testing. If you're going to spend millions of dollars on masks to build your chip, you want to be quite certain that it does what you want!

Here are some principles to follow:

minimize the number of interactive modules (input or output)
avoid circular dependencies (modules at the same level which require each other to work)
independent testing = independent use; testability means code is easier to use elsewhere
minimize hidden code and variables, especially global variables

Maximizing controllability within test programs means that you must minimize the amount of interactive code (input or output) within your code. Use the Model-View-Controller (MVC) programming model and try to group user interaction into as few modules as possible. For scientific and engineering code, this comes naturally - their value comes from their algorithms, not their user interface. But even interactive code is managing data structures somewhere. Keep the data management out of the user interface!

Don't have multiple modules within the same code layer depend on each other. Unless all of them work, none of them work. This makes testing much harder; you will need to test all of the modules in one large test program, and at first you will be chasing bugs in all modules simultaneously. Recursion is a powerful technique, but keep the call cycle small, e.g. function A calls function B which in turn calls function A again.

Remember that independently testable modules are independently usable modules (see Testable Is Reusable). You don't have to make a module completely general from the start, but if it is well-defined and self-contained, it will be that much easier to extend it to meet other needs. And of course with a full test program, you can ensure that the extensions do not break existing module features.

Private functions and variables hurt controllability and observability. There's no disputing that information hiding conflicts with testability. You want to protect data from misuse, but you also want to ensure that it is used or modified when appropriate.

Here are ways to reconcile testability with information hiding in C++:

Provide read-only access to all object variables, even those not meaningful to applications, adding comments in the header stating they are public only for testing.
Make internal mutator member functions public, but add comments stating they are public only for testing.
Make internal mutator member functions const, reading private data and/or parameters, then returning a result instead of storing the result as private data.

I have used all of these methods to maximize testability in my code.

In C, you can use empty typedefs and structure definitions to declare a structure with fields that only authorized code can see. Add mutator functions which accept this type and assign values to it for controllability, and add accessor functions which accept this type and return a value from within it so that you have observability. Then ensure that mutator functions get everything they need from their parameters.

Global variables (including class static variables in C++) are poison. Think very hard before using them. The only valid reason for global variables, particularly private global variables, is that you are writing a service manager for your program that can have only one instance. A global memory manager (e.g. a malloc() replacement) is an example. Synchronization flags and queues for multithreading also fall into this category.

You will find it very hard to test a program that has global variables unless there are very few levels of code that can see or modify them. Hidden code or variables, particularly when they are multiple levels deep, lack controllability and observability because data may be massaged going in and filtered coming out. For example, error checking might replace bad values with sanitized, default values. This can mask bugs.

Even if your program is interactive or multithreaded at high levels, you can organize its code to minimize these aspects. There is usually quite a bit of code under the covers analyzing or modifying data structures. This code can be tested exhaustively, leaving a minimum of informally tested code (which should be isolated as much as possible to ensure it can be understood well).

Backup Testing Strategies

You can't always set up tests that are independent of any outside data. Sometimes you have to read or write files. Binary file parsers, like the OASIS file reader that I developed, are difficult to control because their data is hard to interpret manually. If you have trouble understanding a byte sequence (or it takes a long time to understand it), you probably won't want to enter it as three-letter octal numbers within a program's string constant.

This doesn't apply to most text parsers; it is a simple matter to define a "get next line" object or data structure and have your parser read data from it. One implementation can read from a file and another can read from a string array. Test the "get next line" object separately and then you can use it within the parser without fear that reading from an array will yield different results than reading from a file.

As you get higher and higher in the application's code hierarchy, test case setup and validation can become more difficult as more variables must be assigned and tested. Here "golden" inputs and outputs can be an appropriate testing strategy.

In lower code layers, however, you should reconsider your architecture if "golden" inputs and outputs seem to be the only way to test the code. You are surrendering controllability and observability, because application level data input and printing functions nearly always filter their data.

To maximize the bugs that you find before release, you want to minimize the amount of code that is tested using backup test strategies. Controllability and observability at all levels means that you need a nearly linear number of tests at each level to cover every line of code. If you can call only an API function several layers up, the number of tests required is likely to be exponentially larger - and you will have to set them all up.

Example: An OASIS Reader/Writer

OASIS (Open Artwork System Interchange Standard) is a SEMI (Semiconductor Equipment and Materials International) standard for representing integrated circuit polygon data. It replaces the earlier GDSII file format and includes many methods for reducing the size of integrated circuit layout design files. A typical OASIS file is 10-20 times smaller than an equivalent GDSII file.

Numbers in an OASIS file are variable length using a continuation bit; smaller numbers require fewer bytes. There are multiple ways of representing floating point numbers, some of which require only two bytes instead of the typical four or eight for IEEE 754 floating point numbers. Repeated coordinates (common in integrated circuit layouts) can be omitted through the use of modal variables, and repeated shapes can be represented as single objects with a repetition field. Names can be represented by index numbers instead of strings copied every time a named cell is referenced. Finally, blocks of OASIS data can be compressed inline to save even more space.

All of these features require code, of course; my GDSII reader/writer is about 8,000 lines of code including comments and test programs, while the OASIS reader/writer is about 124,000 lines including comments, test programs, conversion utilities, and advanced file validation covering all of the restrictions on OASIS data.

An error in an OASIS reader or writer could be catastrophic, leading to bad masks at a potential cost of millions of dollars. Thus it was essential to test all of the code completely. To make the code testable, it was organized in strict layer fashion as follows:

layer 0: read or write file bytes, possibly compressed
layer 1: read a primitive (integer, float, or string)
layer 2: read the fields of a record
layer 3: manage relationships between records
layer 4: manage file-level structures; validate records
layer 5: API

Layer 0 handles file buffering, especially when compressing data for writing. Layer 1 manages the representations of numbers and strings in a machine-independent manner. Layer 2 contains the individual record parsers, which read fields based on a flag field. Layer 3 handles modal variables for repeated coordinates and conversion of OASIS-specific geometry representations to generic polygons and paths usable by any application. Layer 4 handles name tables for indexing and implements record-by-record validation. Finally, layer 5 is the API that callers can use to read or write OASIS data without knowing anything about the OASIS specification except record properties, which are user-defined.

Lower level code never references higher-level data structures. For example, one OASIS record type specifies a compressed block that in turn contains additional OASIS records. The layer 0 code knows nothing of this; it simply reads and writes bytes. Compression setup is a little more complex, but the record parser code in layer 2 cannot corrupt the file compression code in layer 0, and vice versa.

Within a layer, some functions are written as utilities in support of others. Thus these functions were developed and tested first, and their test programs run first when building the reader/writer. Utilities within a layer cannot call the functions they support.

These strict rules ensure that the test program for module X never needs to rely on code in a higher layer, or on code that calls module X. Layer 0 was written and tested before any layer 1 code was written; layer 1 was written before any layer 2 code was written, and so on.

Advantages were immediately apparent. When testing the next layer of code, I could skip over calls to lower layers in the debugger, because they had already been tested (see The Cost of Debugging Software). There was almost no uncertainty about their behavior. (Occasional bugs in lower-level code were found; the offending file data or data structure values were added to the corresponding test program to ensure that the fixes did not break anything and that the bugs never recurred).

In addition, because the file I/O was strictly separated from data structure management, record validation and optimization (for repetition creation) code could be tested without writing any temporary files. Because the code which reads or writes an OASIS file framework is in layer 5, few test programs could write OASIS files easily. The separation meant that much of the error checking and repetition analysis code could use data structures created directly in memory.

The full set of test programs meant that I could write the code on any platform I chose, then compile on any other platform and know that I would get the same results (or know immediately when I did not get the same results). The code works identically on 32-bit Windows, 32-bit Linux, and 64-bit Linux (except that polygon coefficients can be larger on a 64-bit platform).

Conclusions

It takes discipline to write software with a standard of zero bugs, but it is possible at a reasonable cost. The code structure necessary for proper testing makes it easier to reuse that code elsewhere. The confidence you get from working with fully tested code makes product releases much less stressful, and fully automated testing makes product releases faster. Finally, improved software quality makes customers happier.

And even if you do not test every line of code on the first day, a testable software architecture and standalone test programs allow you to add tests as you need them (e.g. whenever a bug is found) and help ensure that bugs don't come back.

Chapman Consulting

Software Development Done Right.