Release testing for probable correctness

A plausible, probabilistic theory for release testing of software is presented. In a release test a given program is assessed and testing ends. The theory gives the number of test points that must be used to guarantee correctness with a given probability. Its strengths are: nothing need be assumed about the distribution or likelihood of failures or faults a priori; and, i t is a true correctness theory, not one predicting future behavior by sampling a supposed "operational distribution. Application of the theory t o partition testing and structural testing suggests ways to decide when and how those methods should be used. 1. The Many Kinds of Testing Testing of computer software occurs in a wide variety of situations, each with different goals. Initial unit tests are often part of debugging, for local diagnosis and fault correction. When software is first integrated, testing may provide input to a model of fault repair that seeks to predict the effort needed to reach an acceptable level of quality. And finally, when software is released, test performance is pure assessment: a t best no failures are found, but in any case the purpose of the test is to gain confidence in the decision to stop testing. Theories appropriate to these distinct goals also perhaps are distinct. Debugging theory may require a model of the programming process that includes the import ant psychological contribution of the human programmer [I]. The fault-correction process is so complex that an empirical model with many parameters for fitting complex behavior may be appropriate [2]. Theories of testing have often lacked a clear focus [S]. An "absolute" theory views the program and its specification as a puzzle, and tests as probes intended to solve it. The object of testing is t o find such clever tests that the puzzle can be unravelled, the program can be shown by the tests t o be correct. Although in special cases of restricted languages [4] or special knowledge of the form the correct program must take [5, 61 the absolute view has been successful, in general it must fail because the problem of fixing the infinite behavior of a program with a finite number of test cases cannot be solved [7]. A "debugging" theory seeks to detect errors that commonly occur in programs. Tests are so chosen that often-made blunders will be caught. Examples of such schemes are structural testing (for a recent example see [8]), in which the program's parts are exercised; and domain testing based on the specification, which seeks to catch out the programmer who has omitted cases or confused boundary conditions [9]. These debug methods are very useful--indeed, faced with the actual unit testing of real software, no other systematic method is available--but they cannot claim theoretical validity. To see this, imagine the programmer as an antagonist. If the debug methods to be used on a program were known, the program might be adjusted to pass its tests, yet still be incorrect. The release test represents a clean situation. In the simplest case the test exposes no errors. The goal of a release-test theory is t o assess the confidence inspired by this success, as the debug theories cannot. One may hope to realize this goal without considering the process by which the software was developed. Any useful theory of release testing must be probabilistic in nature. Testing often stops only because resources, patience, ingenuity are exhausted. The quality of the resulting test can be expected to vary, so success has varying implications for the confidence to be placed in the program. An absolute theory can only call most such tests failures, and cannot distinguish the better from the worse. 2. Tests as Samples of Program Behavior "Probabilistic" testing theories have not been popular (but see [lo, 11)). The idea that tests are samples allowing statistical statements to be made about a program seems unexceptionable, but there is controversy about the significance of these statements. Existing models and analogies are flawed. 2.1 Input data Space The most straightforward statistical view of testing sees the test inputs as drawn from the entire input space as samples. By observing the program's behavior on these samples, predictions can be made about its behavior on an arbitrary sample not yet drawn. Samples must be independently selected, and the likelihood that a given input be selected must conform t o actual usage. There is an "operational distribution" for any program, the probability F(x) that input x will actually be used. For a sample drawn according t o this distribution, standard methods can be used to predict the probability of failure, and the confidence to be attributed t o this prediction, based on the failure behavior of the sample. For 1 e confidence that the probability of no faults is a t least