Debugging program failures exhibited by voluminous data

In debugging, a programmer often observes the values of program variables--the state--at various points of program execution. A large data may cause a program failure after several iterations, hence generating a large number of intermediate states. Debugging a program when the failure is exhibited by a large data is hard because the clues that may help in locating the fault are obscured by the large amount of information the programmer has to process. From a database of debugging experiences maintained at the Open University in the United Kingdom, we found cases where programmers had to abandon the debugging of such failures, while in other such cases programmers spent weeks and/or months doing debugging. Clearly, a smaller data which exhibits the same failure should lead to the diagnosis of faults more quickly than its larger counterpart. In this dissertation, we investigate five techniques for deriving a smaller data that reproduces the failure as an original input data. We term such a smaller data a data slice. The process of creating a data slice is called data slicing. The five techniques are: invariance analysis, origin tracking, random elimination, input-output analysis, and program-specific heuristics. The choice of a technique for data slicing may be based on the properties of a program, classified by the relationship between the input and output elements. Once a data slice is obtained, other general purpose debugging techniques can be employed to locate the fault. Data slicing enables the programmer to debug failures exhibited by a large data more efficiently, and, more importantly, to proceed with debugging tasks in cases where it seems impossible otherwise.