Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system

Among the many different kinds of program repair techniques, one widely studied family of techniques is called test suite based repair. However, test suites are in essence input-output specifications and are thus typically inadequate for completely specifying the expected behavior of the program under repair. Consequently, the patches generated by test suite based repair techniques can just overfit to the used test suite, and fail to generalize to other tests. We deeply analyze the overfitting problem in program repair and give a classification of this problem. This classification will help the community to better understand and design techniques to defeat the overfitting problem. We further propose and evaluate an approach called UnsatGuided, which aims to alleviate the overfitting problem for synthesis-based repair techniques with automatic test case generation. The approach uses additional automatically generated tests to strengthen the repair constraint used by synthesis-based repair techniques. We analyze the effectiveness of UnsatGuided: 1) analytically with respect to alleviating two different kinds of overfitting issues; 2) empirically based on an experiment over the 224 bugs of the Defects4J repository. The main result is that automatic test generation is effective in alleviating one kind of overfitting, issue–regression introduction, but due to oracle problem, has minimal positive impact on alleviating the other kind of overfitting issue–incomplete fixing.

[1]  Michael D. Ernst,et al.  Randoop: feedback-directed random testing for Java , 2007, OOPSLA '07.

[2]  Chao Liu,et al.  Statistical Debugging: A Hypothesis Testing-Based Approach , 2006, IEEE Transactions on Software Engineering.

[3]  Jaechang Nam,et al.  Automatic patch generation learned from human-written patches , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[4]  Martin Monperrus,et al.  Automatic Software Repair , 2018, ACM Comput. Surv..

[5]  Kai-Yuan Cai,et al.  Mutation-oriented test data augmentation for GUI software fault localization , 2013, Inf. Softw. Technol..

[6]  Martin Monperrus,et al.  Dynamic patch generation for null pointer exceptions using metaprogramming , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[7]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[8]  Yuhua Qi,et al.  The strength of random search on automated program repair , 2014, ICSE.

[9]  Koushik Sen,et al.  CUTE: a concolic unit testing engine for C , 2005, ESEC/FSE-13.

[10]  Westley Weimer,et al.  Leveraging program equivalence for adaptive program repair: Models and first results , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[11]  Serge Demeyer,et al.  Fine-tuning spectrum based fault localisation with frequent method item sets , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[12]  Alex Shaw,et al.  Automatically Fixing C Buffer Overflows Using Program Transformations , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[13]  Alberto Savoia,et al.  Differential testing: a new approach to change detection , 2007, ESEC-FSE '07.

[14]  David Lo,et al.  Overfitting in semantics-based automated program repair , 2018, Empirical Software Engineering.

[15]  Gordon Fraser,et al.  An Industrial Evaluation of Unit Test Generation: Finding Real Faults in a Financial Application , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[16]  Kai-Yuan Cai,et al.  GUI Software Fault Localization Using N-gram Analysis , 2011, 2011 IEEE 13th International Symposium on High-Assurance Systems Engineering.

[17]  Abhik Roychoudhury,et al.  Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[18]  Luciano Baresi,et al.  TestFul: An Evolutionary Test Approach for Java , 2010, 2010 Third International Conference on Software Testing, Verification and Validation.

[19]  Michael D. Ernst,et al.  Automatically patching errors in deployed software , 2009, SOSP '09.

[20]  Gordon Fraser,et al.  Do Automatically Generated Unit Tests Find Real Faults? An Empirical Study of Effectiveness and Challenges (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21]  Fan Long,et al.  Automatic inference of code transforms for patch generation , 2017, ESEC/SIGSOFT FSE.

[22]  Fan Long,et al.  An analysis of patch plausibility and correctness for generate-and-validate patch generation systems , 2015, ISSTA.

[23]  Qi Xin,et al.  Identifying test-suite-overfitted patches through test case generation , 2017, ISSTA.

[24]  Paolo Tonella,et al.  Evolutionary testing of classes , 2004, ISSTA '04.

[25]  Sumit Gulwani,et al.  Oracle-guided component-based program synthesis , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[26]  Alexey Zhikhartsev,et al.  Better test cases for better automated program repair , 2017, ESEC/SIGSOFT FSE.

[27]  Koushik Sen,et al.  DART: directed automated random testing , 2005, PLDI '05.

[28]  Fan Long,et al.  Staged program repair with condition synthesis , 2015, ESEC/SIGSOFT FSE.

[29]  Robert H. Deng,et al.  Practice and Experience , 2021 .

[30]  Gary C. Brown A cure worse than the disease? , 1996, America.

[31]  Yuriy Brun,et al.  Is the cure worse than the disease? overfitting in automated program repair , 2015, ESEC/SIGSOFT FSE.

[32]  Matias Martinez,et al.  Test Case Generation for Program Repair: A Study of Feasibility and Effectiveness , 2017, ArXiv.

[33]  Claire Le Goues,et al.  GenProg: A Generic Method for Automatic Software Repair , 2012, IEEE Transactions on Software Engineering.

[34]  Corina S. Pasareanu,et al.  Symbolic PathFinder: symbolic execution of Java bytecode , 2010, ASE.

[35]  Lars Grunske,et al.  A learning-to-rank based fault localization approach using likely invariants , 2016, ISSTA.

[36]  I. S. W. B. Prasetya T3, a Combinator-Based Random Testing Tool for Java: Benchmarking , 2013, FITTEST@ICTSS.

[37]  Nikolai Tillmann,et al.  Pex-White Box Test Generation for .NET , 2008, TAP.

[38]  Christoph Csallner,et al.  Dsc+Mock: a test case + mock class generator in support of coding against interfaces , 2010, WODA '10.

[39]  Abhik Roychoudhury,et al.  A correlation study between automated program repair and test-suite metrics , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[40]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[41]  Martin Monperrus,et al.  Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs , 2018, IEEE Transactions on Software Engineering.

[42]  Gordon Fraser,et al.  EvoSuite: automatic test suite generation for object-oriented software , 2011, ESEC/FSE '11.

[43]  Gang Huang,et al.  Identifying Patch Correctness in Test-Based Program Repair , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[44]  Kai-Yuan Cai,et al.  Does the Failing Test Execute a Single or Multiple Faults? An Approach to Classifying Failing Tests , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[45]  Zhendong Su,et al.  Has the bug really been fixed? , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[46]  Tao Xie,et al.  Augmenting Automatically Generated Unit-Test Suites with Regression Oracle Checking , 2006, ECOOP.

[47]  Chen Fu,et al.  CarFast: achieving higher statement coverage faster , 2012, SIGSOFT FSE.

[48]  Matias Martinez,et al.  Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset , 2016, Empirical Software Engineering.

[49]  Dawei Qi,et al.  SemFix: Program repair via semantic analysis , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[50]  Mary Jean Harrold,et al.  Empirical evaluation of the tarantula automatic fault-localization technique , 2005, ASE.

[51]  Andreas Zeller,et al.  Automated Fixing of Programs with Contracts , 2014 .

[52]  Xiangyu Zhang,et al.  Locating faults through automated predicate switching , 2006, ICSE.

[53]  Michael D. Ernst,et al.  Evaluating and Improving Fault Localization , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[54]  Abhik Roychoudhury,et al.  DirectFix: Looking for Simple Program Repairs , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[55]  David Brumley,et al.  RICH: Automatically Protecting Against Integer-Based Vulnerabilities , 2007, NDSS.

[56]  David F. Bacon,et al.  Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion , 2007, OOPSLA 2007.

[57]  Jiachen Zhang,et al.  Precise Condition Synthesis for Program Repair , 2016, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[58]  Lu Zhang,et al.  Safe Memory-Leak Fixing for C Programs , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[59]  Yannis Smaragdakis,et al.  JCrasher: an automatic robustness tester for Java , 2004, Softw. Pract. Exp..

[60]  Baishakhi Ray,et al.  Automatically diagnosing and repairing error handling bugs in C , 2017, ESEC/SIGSOFT FSE.

[61]  Matias Martinez,et al.  ASTOR: a program repair library for Java (demo) , 2016, ISSTA.

[62]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[63]  Lu Zhang,et al.  Identifying Patch Correctness in Test-Based Automatic Program Repair , 2017, arXiv.org.

[64]  David Lo,et al.  S3: syntax- and semantic-guided repair synthesis via programming by examples , 2017, ESEC/SIGSOFT FSE.

[65]  Tao Xie,et al.  DiffGen: Automated Regression Unit-Test Generation , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[66]  Michael D. Ernst,et al.  Are mutants a valid substitute for real faults in software testing? , 2014, SIGSOFT FSE.

[67]  Fan Long,et al.  Automatic patch generation by learning correct code , 2016, POPL.