Theoretical and empirical studies on using program mutation to test the functional correctness of programs

In testing for program correctness, the standard approaches[11,13,21,22,23,24,34] have centered on finding data D, a finitesubset of all possible inputs to program P, such that 1) if for all x in D, P(x) = f(x), then P* = f where f is a partial recursive function that specifies theintended behavior of the program and P* is the function actuallycomputed by program P. A major stumbling block in suchformalizations has been that the conclusion of (1) is so strongthat, except for trivial classes of programs, (1) is bound to beformally undecidable [23]. There is an undeniable tendency among practitioners to considerprogram testing an ad hoc human technique: one creates test datathat intuitively seems to capture some aspect of the program,observes the program in execution on it, and then draws conclusionson the program's correctness based on the observations. To augmentthis undisciplined strategy, techniques have been proposed thatyield quantitative information on the degree to which a program hasbeen tested. (See Goodenough [14] for a recent survey.) Thus thetester is given an inductive basis for confidence that (1) holdsfor the particular application. Paralleling the undecidability ofdeductive testing methods, the inductive methods all have hadtrivial examples of failure [14,18,22,23]. These deductive and inductive approaches have had a commontheme: all have aimed at the strong conclusion of (1). Programmutation [1,7,9,27], on the other hand, is a testing technique thataims at drawing a weaker, yet quite realistic, conclusion of thefollowing nature: (2) if for all x in D, P(x) = f(x), then P* = f OR P is"pathological." To paraphrase, 3) if P is not pathological and P(x) = f(x) for all x in D thenP* = f. Below we will make precise what is meant by "P is pathological";for now it suffices to say that P not pathological means that P waswritten by a competent programmer who had a good understanding ofthe task to be performed. Therefore if P does not realize f it is"close" to doing so. This underlying hypothesis of program mutationhas become known as the competent programmer hypothesis:either P* = f or some program Q "close" to P has the property Q* =f. To be more specific, program mutation is a testing method thatproposes the following version of correctness testing: Given that P was written by a competent programmer, find testdata D for which P(D) = f(D) implies P* = f. Our method of developing D, assuming either P or some programclose to P is correct, is by eliminating the alternatives. Let&phis; be the set of programs close to P. We restate the methodas follows: Find test data D such that: i) for all x in D P(x) = f(x) and ii) for all Q in &phis; either Q* = P* or for some x in D,Q(x) ≠ P(x). If test data D can be developed having properties (i) and (ii),then we say that D differentiates P from &phis;,alternatively P passes the &phis; mutant test. The goal of this paper is to study, from both theoretical andexperimental viewpoints, two basic questions: Question 1: If P is written by a competent programmer andif P passes the &phis; mutant test with test data D, does P* =f? Note that, after formally defining &phis; for P in a fixedprogramming language L, an affirmative answer to question 1 reducesto showing that the competent programmer hypothesis holds for thisL and &phis;. We have observed that under many natural definitions of&phis; there is often a strong coupling between members of&phis; and a small subset µ. That is, often one canreduce the problem of finding test data that differentiates P from&phis; to that of finding test data that differentiates P fromµ. We will call this subset µ the mutants of Pand the second question we will study involves the so-calledcoupling effect [9]: Question 2 (Coupling Effect): If P passes the µmutant test with data D, does P pass the &phis; mutant testwith data D? Intuitively, one can think of µ as representing theprograms that are "very close" to P. In the next section we will present two types of theoreticalresults concerning the two questions above: general resultsexpressed in terms of properties of the language class L, andspecific results for a class of decision table programs and for asubset of LISP. Portions of the work on decision tables and LISPhave appeared elsewhere [5,6], but the presentations given here areboth simpler and more unified. In the final section we present asystem for applying program mutation to FORTRAN and we introduce anew type of software experiment, called a "beat the system"experiment, for evaluating how well our system approximates anaffirmative response to the program mutation questions.

[1]  Timothy Alan Budd,et al.  Mutation analysis of program test data , 1980 .

[2]  Lawrence Yelowitz,et al.  Observations of Fallibility in Applications of Modern Programming Methodologies , 1976, IEEE Transactions on Software Engineering.

[3]  J. C. Huang,et al.  An Approach to Program Testing , 1975, CSUR.

[4]  Henry F. Ledgard The case for structured programming , 1974 .

[5]  John B. Goodenough,et al.  Correction to "toward a theory of test data selection" , 1975, IEEE Transactions on Software Engineering.

[6]  Richard J. Lipton,et al.  Hints on Test Data Selection: Help for the Practicing Programmer , 1978, Computer.

[7]  Matthew M. Geller Test data as an aid in proving program correctness , 1978, CACM.

[8]  John D. Gould,et al.  An Exploratory Study of Computer Program Debugging1 , 1974 .

[9]  Niklaus Wirth,et al.  PL360, a Programming Language for the 360 Computers , 1968, JACM.

[10]  Karl N. Levitt,et al.  SELECT—a formal system for testing and debugging programs by symbolic execution , 1975 .

[11]  Peter Henderson,et al.  An experiment in structured programming , 1972 .

[12]  Peter Naur,et al.  Programming by action clusters , 1969 .

[13]  William E. Howden An evaluation of the effectiveness of symbolic testing , 1978, Softw. Pract. Exp..

[14]  Solomon L. Pollack,et al.  Decision Tables Theory and Practice , 1971 .

[15]  P. D. Summers,et al.  Program construction from examples. , 1975 .

[16]  S. L. Gerhart,et al.  Toward a theory of test data selection , 1975, IEEE Transactions on Software Engineering.

[17]  David E. Shaw,et al.  Inferring LISP Programs From Examples , 1975, IJCAI.

[18]  J. R. Brown,et al.  Testing for software reliability , 1975 .

[19]  Leon J. Osterweil,et al.  Some experience with DAVE: a Fortran program analyzer , 1976, AFIPS '76.

[20]  Gordon H. Bradley,et al.  Algorithm and bound for the greatest common divisor of n integers , 1970, CACM.

[21]  G. Metze,et al.  Fault diagnosis of digital systems , 1970 .

[22]  William E. Howden Methodology for the Generation of Program Test Data , 1975, IEEE Transactions on Computers.

[23]  Jeffrey D. Ullman,et al.  Formal languages and their relation to automata , 1969, Addison-Wesley series in computer science and information processing.

[24]  William E. Howden,et al.  Reliability of the Path Analysis Testing Strategy , 1976, IEEE Transactions on Software Engineering.

[25]  Steve Hardy,et al.  Synthesis Of LISP Functions From Examples , 1975, IJCAI.

[26]  Lee J. White,et al.  A Domain Strategy for Computer Program Testing , 1980, IEEE Transactions on Software Engineering.