Levels of Software Testing

Software Without Bugs?

Have you ever released a program without any bugs? Not likely! It is well known that it is not possible to prove a program is bug-free; it is only possible to demonstrate that a program has some bugs (by running it and showing the erroneous output). Even if the code is generated automatically, the specification might have bugs. Your goal, then, is to reduce the bug rate to an acceptable (perhaps undetectable) level.

So how do you determine that the bug rate has declined enough? That depends on your testing goals and your customers' willingness to accept early bugs. If your product helps customers solve a problem that they otherwise could not solve, they might accept occasional bugs. If there are competing products on the market already, bug tolerance might be very low.

Ideally, the standard for release will be zero known bugs, but your management may not be willing to invest that much time for testing. My experience with all-conditions white box testing (see Yes, You Can Test Every Line of Code) has been that about one line of test code is required for every line of product code. That means, of course, that testing a unit of code takes about as long as writing the code.

Even quick-and-dirty programs benefit from robust testing. I wrote a species display card printing program for the local chapter of the California Native Plant Society to read data from a comma-separated-value (CSV) file and print species cards in HTML according to a specified format. There were two separate subsystems - read a CSV file and extract species information from it - in a program totaling about 2,000 lines with comments. I wrote a test program for each of the subsystems, bringing the total lines of code up to about 3,300 lines. Not only did these give me the confidence needed to print 1,500 species cards for laminating, but the code as structured for testability is already being reused for other projects. See Testable Is Reusable.

Minimal Robust Testing

Assuming your management is not user-hostile, you'll want to do robust testing. Even if your test budgets are tight, you can do some proactive testing fairly easily. Standalone test programs with basic test frameworks for every module (or at least every abstraction layer - see Writing Code Layer by Layer) give you an infrastructure for robust testing. Even if you don't try to test every line of code right away, the infrastructure will allow you to add test cases directly to the standalone program, without requiring a full application-level test.

At the very least, you will want test programs for each abstraction layer, in particular the Applications Programming Interface (API) for each layer. API routines tend to be self-contained, requiring less setup than the (usually hidden) support routines for that abstraction layer. Thus you can run basic tests more quickly, while still obtaining fair coverage of the code.

If you don't have test programs at every abstraction layer, you'll need more tests. As the call chain to a low-level routine gets longer, the conditions required to exercise a particular line of code also increase. You may end up with a situation like SQLite, where the tests are over 1000 times as large as the product (http://www.sqlite.org/testing.html).

In the worst case, future code changes somewhere else in your application may change the conditions required to exercise a particular line of code and your test may no longer meet the requirements. Optimization programs such as those found in Electronic Design Automation (EDA) are particularly prone to this problem. I have had test cases for bug fixes stop exercising the code which crashed. Absent a way to invoke the code directly, we had no way to ensure that the bugs were still fixed.

Basic Black Box Testing

To start, you need to ensure your product meets some minimal standards of functionality. If none of the features that are your selling points even work, your product won't ever be used. The simplest tests are called "smoke tests," and they exercise each of the features. This is typical for application-level testing - what does the user see, and does it appear to work?

Black box tests are tests written without looking at (sometimes without even knowing) the code being executed. These tests ensure that the code accepts valid inputs and exclude invalid inputs. Ideally results from the good inputs are validated and the error messages from the bad inputs are examined.

Inputs include both data and configuration parameters (application settings). Naturally the number of possible combinations of all input values and configuration parameters is very high, so some assumptions about dependence between values must be made. Usually input values and configuration parameters form groups, and the values within a particular group are changed in a sequence of tests while all other values remain the same.

Generally the range for each input or parameter will be part of the product's specification. Typical practice is to choose values at each limit of the value (low or high), values just inside the limits, values just outside the limits, and a sprinkling of values in the middle of the valid range. Values that are barely valid or barely invalid ensure that bounds checks are configured properly; values in the middle of the range represent "typical" inputs.

Usually only one input or parameter will be invalid at a time, while all others are valid. The implicit assumption is that all error checks are independent, that an error check will not rely on the values of other inputs or parameters. If this assumption is not true, then still more tests must be added.

Even with these assumptions, huge numbers of tests may be required at the application level; the SQLite project runs millions of tests prior to a full release. Full statement and condition coverage is also very hard to demonstrate with black box testing.

Basic White Box Testing

By definition, it is not possible to exercise every line and every condition in a program without looking at the source code. It is necessary to "open the box" at some point, so white box testing uses an analysis of the source code to define additional tests. Black box testing is still necessary at each level of the code - it is the best way to determine when parts of the specification were not implemented - but for complex code it can never exercise the implementation fully.

Like basic black box testing, the goal of basic white box testing is to ensure that there are no catastrophic failures in in any application layer. Every function needs to be executed at least once, with examples of known good inputs and known bad inputs for each function.

Running "smoke tests" at every layer of the application helps minimize the cost of debugging upper layers (see The Cost of Debugging Software). Access to the actual source code also simplifies validation - you don't need to write output file parsers. Instead, variables can be tested directly against the expected values.

Because many of the code layers being invoked are normally hidden, there may not be simple APIs. As a result, you usually need standalone test programs written in the same language as the product. These should be executed whenever a subsystem is built (e.g. in the makefile). They are not shipped with the product or subsystem, but they have many benefits, as described below.

All-statements White Box Testing

All-statements white box testing goes beyond basic white box testing by seeking to execute every line of code at least once. Basically, every loop must be run at least once, and each branch of every "if" statement must be run at least once. You ensure that the code runs as expected, though the tests might not exercise every condition required to get to a particular line of code. Thus some subtle bugs might still lurk.

Precise control of variables at this level of detail requires standalone test driver programs or significant numbers of application-level tests. Driver programs are strongly recommended because it may be hard to set up an application-level test which has all condition variables assigned in the manner needed.

All-conditions White Box Testing

The highest standard for white box testing ensures that every possible condition is activated. Every line of code is executed under all condition variations to detect interrelationships between variables. For example, the C language expression (a || b || c) will be true if any of the variables is true (non-zero). All-conditions white box testing of this expression requires three tests, each with a different variable set to a true value.

At this level, you ensure that the code runs as implemented. Of course, this is not quite the same as saying that it runs as intended - the implementation might not match the specification - but you will know that the code runs as the developer intended. Tests at this level will catch code changes that affect the results returned to callers. If you can add a feature or fix a bug without affecting any of the existing tests, you know that existing callers will continue to be able to use the code.

Precise control of variables at this level of detail requires standalone test driver programs at every abstraction layer or copious numbers of application-level tests. Test driver programs are strongly recommended because it is nearly impossible to set up an application-level test which has all condition variables assigned in the manner needed. This is especially true for optimization algorithms or applications with several abstraction layers.

Don't Ever Throw Test Code Away! (Advantages of Standalone Test Drivers)

As I describe more fully in Never Throw Test Software Away, there are many advantages to standalone test drivers:

You get finer control of the code being tested.
Tests run much faster because you don't need to set up the full application each time.
Debugging and test development are much easier because you can set up the failing example directly and analyze it.
If test drivers are developed with (or just after) the code under test, they provide an example of how the code will be used in practice.

Unless you are brave (or foolhardy) enough to skip testing of your application until all of it is done, you will be writing some test code to exercise its abstraction layers. Standalone test programs let you keep the value provided by these test stubs. Because they run without human intervention, they are much better for qualification runs - they can be run for every source code commit and every nightly build.

Tests that exercise only a specific piece of code will also run faster than tests invoked at the application level. You don't really need to run the command line parser for every test of a file I/O routine, for example.

Even if you don't write all-conditions white box test programs immediately, keeping the test code in the form of a standalone test driver allows you to add new test cases at any time. If you get a bug, add its setup to the test program, run the test program, and verify that the bug can be reproduced (if it isn't, you haven't performed all of the setup necessary). Now you can fix the problem without the overhead of the full application, and the test will remain in the driver forever to ensure the bug doesn't return.

Finally, if you follow the layer-by-layer development strategy I described in Writing Code Layer by Layer, sketching the application's design from the top down and implementing code from the bottom up, immediately writing a test driver lets you ensure that the newly written application code can in fact be used. If individual test setup is complicated, the code will be hard to use - and you will have more bugs in the next layer.

The Limits of Standalone Test Programs

At some point, you will write the top-level application driver. This is the point at which a standalone test program is no longer feasible, and less-capable test methods are needed. Naturally you want to minimize the amount of code in the top-level application driver. It should only read top-level parameters, select a course of action, and then invoke tested code to perform that action.

Here "golden" outputs are necessary, since you can no longer have code-level access to data structures and cannot test them directly. A script-based test driver is very helpful here. This is often written in a scripting language; for each test it launches the application with a specific set of parameters (command line flags and input files), then examines the program's return code, printed results (e.g. from stdout and stderr in C/C++), and output files. Often some manipulation of outputs is necessary for full repeatability, so the script driver may apply modifications (e.g. removing dates from log files).

Generally I write my own standalone test programs because customization is simpler. Often a series of small tests will be run on a single object or data structure; this reduces the amount of setup code and allows individual tests to be more complex (increasing code coverage more quickly). If you wish, you can look at open source unit test frameworks like Check (http://sourceforge.net/projects/check) or CppUnit (http://sourceforge.net/apps/mediawiki/cppunit/index.php?title=Main_Page). I don't have any experience with them and cannot recommend for or against them. Ditto for commercial products like Rational Test RealTime (http://www-01.ibm.com/software/awdtools/test/realtime) or Parasoft C/C++test (http://www.parasoft.com/jsp/products/cpptest.jsp?itemId=47).

Most of my work has been in data-intensive software like circuit optimization or analysis, text processing, parsing, and the like. I rarely write GUI-based programs, but many of the same principles apply: push as much of the processing as possible into separate, testable modules, so that the user interface does nothing except dispatch inputs to tested code, then display the results on the screen. Wikipedia has a list of GUI testing tools at http://en.wikipedia.org/wiki/List_of_GUI_testing_tools. As before, I have not used these tools and cannot recommend for or against any of them.

Conclusions

Software quality requires software testing. For the highest levels of software quality, you need full testing. Although there are strategies to maximize testing effectiveness when the initial testing budget is insufficient (see When to Stop Adding Tests During Development), in the end you will simply outsource bug-finding to your customers - who may not be pleased at the costs this imposes on them. You're going to test all of the code one way or another (bug fixing is a kind of testing); you might as well do the job right from the beginning.

Chapman Consulting

Software Development Done Right.