An empirical study about the effectiveness of debugging when random test cases are used

Automatically generated test cases are usually evaluated in terms of their fault revealing or coverage capability. Beside these two aspects, test cases are also the major source of information for fault localization and fixing. The impact of automatically generated test cases on the debugging activity, compared to the use of manually written test cases, has never been studied before. In this paper we report the results obtained from two controlled experiments with human subjects performing debugging tasks using automatically generated or manually written test cases. We investigate whether the features of the former type of test cases, which make them less readable and understandable (e.g., unclear test scenarios, meaningless identifiers), have an impact on accuracy and efficiency of debugging. The empirical study is aimed at investigating whether, despite the lack of readability in automatically generated test cases, subjects can still take advantage of them during debugging.

[1]  Peter M. Chisnall,et al.  Questionnaire Design, Interviewing and Attitude Measurement , 1993 .

[2]  Koushik Sen DART: Directed Automated Random Testing , 2009, Haifa Verification Conference.

[3]  Marco Torchiano,et al.  Using acceptance tests as a support for clarifying requirements: A series of experiments , 2009, Inf. Softw. Technol..

[4]  Michael D. Ernst,et al.  Randoop: feedback-directed random testing for Java , 2007, OOPSLA '07.

[5]  Westley Weimer,et al.  A human study of fault localization accuracy , 2010, 2010 IEEE International Conference on Software Maintenance.

[6]  Koushik Sen,et al.  CUTE: a concolic unit testing engine for C , 2005, ESEC/FSE-13.

[7]  Liang Huang,et al.  Empirical investigation towards the effectiveness of Test First programming , 2009, Inf. Softw. Technol..

[8]  Bertrand Meyer,et al.  Experimental assessment of random testing for object-oriented software , 2007, ISSTA '07.

[9]  Phyllis G. Frankl,et al.  An experimental comparison of the effectiveness of the all-uses and all-edges adequacy criteria , 1991, TAV4.

[10]  Andreas Zeller,et al.  Mutation-Driven Generation of Unit Tests and Oracles , 2012, IEEE Trans. Software Eng..

[11]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[12]  Nikolai Tillmann,et al.  Pex-White Box Test Generation for .NET , 2008, TAP.

[13]  Elliot Soloway,et al.  Papers presented at the first workshop on empirical studies of programmers on Empirical studies of programmers , 1986 .

[14]  Robert J. Simmons,et al.  Proofs from Tests , 2008, IEEE Transactions on Software Engineering.

[15]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[16]  Alessandro Orso,et al.  Are automated debugging techniques actually helping programmers? , 2011, ISSTA '11.

[17]  Simeon C. Ntafos,et al.  An Evaluation of Random Testing , 1984, IEEE Transactions on Software Engineering.

[18]  Alex Groce,et al.  Random Test Run Length and Effectiveness , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[19]  Mark Weiser,et al.  Experiments on slicing-based debugging aids , 1986 .

[20]  David F. Bacon,et al.  Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion , 2007, OOPSLA 2007.