Exploring True Test Overfitting in Dynamic Automated Program Repair using Formal Methods

Automated program repair (APR) techniques have shown a promising ability to generate patches that fix program bugs automatically. Typically such APR tools are dynamic in the sense that they find bugs by testing and they validate patches by running a program’s test suite. Patches can also be validated manually. However, neither of these methods for validating patches can truly tell whether a patch is correct. Test suites are usually incomplete, and thus APR-generated patches may pass the tests but not be truly correct; in other words, the APR tools may be overfitting to the tests. The possibility of test overfitting leads to manual validation, which is costly, potentially biased, and can also be incomplete. Therefore, we must move past these methods to truly assess APR’s overfitting problem.We aim to evaluate the test overfitting problem in dynamic APR tools using ground truth given by a set of programs equipped with formal behavioral specifications. Using these formal specifications and an automated verification tool, we found that there is definitely overfitting in the generated patches of seven well-studied APR tools, although many (about 59%) of the generated patches were indeed correct. Our study further points out two new problems that can affect APR tools: changes to the complexity of programs and numeric problems. An additional contribution is that we introduce the first publicly available data set of formally specified and verified Java programs, their test suites, and buggy variants, each of which has exactly one bug.

[1]  Henry Coles,et al.  Demo: PIT a Practical Mutation Testing Tool for Java , .

[2]  Claire Le Goues,et al.  Current challenges in automatic software repair , 2013, Software Quality Journal.

[3]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[4]  Daniela Micucci,et al.  Automatic Software Repair: A Survey , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[5]  Armando Solar-Lezama,et al.  QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge , 2017, SPLASH.

[6]  Viktor Kuncak,et al.  Counterexample-Guided Quantifier Instantiation for Synthesis in SMT , 2015, CAV.

[7]  Yuriy Brun,et al.  Is the cure worse than the disease? overfitting in automated program repair , 2015, ESEC/SIGSOFT FSE.

[8]  Byron Cook,et al.  Formal Reasoning About the Security of Amazon Web Services , 2018, CAV.

[9]  Jacques Klein,et al.  On the Efficiency of Test Suite based Program Repair A Systematic Assessment of 16 Automated Repair Systems for Java Programs , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[10]  Roderick Bloem,et al.  Program Repair as a Game , 2005, CAV.

[11]  David R. Cok,et al.  OpenJML: Software verification for Java 7 using JML, OpenJDK, and Eclipse , 2014, F-IDE.

[12]  Martin Monperrus,et al.  Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs , 2018, IEEE Transactions on Software Engineering.

[13]  Yuhua Qi,et al.  The strength of random search on automated program repair , 2014, ICSE.

[14]  Ming Wen,et al.  How Different Is It Between Machine-Generated and Developer-Provided Patches? : An Empirical Study on the Correct Patches Generated by Automated Program Repair Techniques , 2019, 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[15]  Sarfraz Khurshid,et al.  Specification-Based Program Repair Using SAT , 2011, TACAS.

[16]  David L. Dill,et al.  Acceptance of Formal Methods : Lessons from Hardware Design , 1996 .

[17]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[18]  Corina S. Pasareanu,et al.  DifFuzz: Differential Fuzzing for Side-Channel Analysis , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[19]  Qi Xin,et al.  Identifying test-suite-overfitted patches through test case generation , 2017, ISSTA.

[20]  Martin Monperrus,et al.  IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs , 2016 .

[21]  Stephen A. Cook,et al.  Soundness and Completeness of an Axiom System for Program Verification , 1978, SIAM J. Comput..

[22]  David Lo,et al.  On Reliability of Patch Correctness Assessment , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[23]  Sharon Barner,et al.  Wolf - Bug Hunter for Concurrent Software Using Formal Methods , 2005, CAV.

[24]  Michael D. Ernst,et al.  An overview of JML tools and applications , 2003, International Journal on Software Tools for Technology Transfer.

[25]  Rui Abreu,et al.  A Survey on Software Fault Localization , 2016, IEEE Transactions on Software Engineering.

[26]  Wolfgang Banzhaf,et al.  ARJA: Automated Repair of Java Programs via Multi-Objective Genetic Programming , 2017, IEEE Transactions on Software Engineering.

[27]  Bo Yang,et al.  Exploring the Differences between Plausible and Correct Patches at Fine-Grained Level , 2020, 2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF).

[28]  David R. Cok,et al.  OpenJML: JML for Java 7 by Extending OpenJDK , 2011, NASA Formal Methods.

[29]  Christian Decker,et al.  Bitcoin Transaction Malleability and MtGox , 2014, ESORICS.

[30]  Rui Abreu,et al.  Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts , 2019, ESEC/SIGSOFT FSE.

[31]  Gang Huang,et al.  Identifying Patch Correctness in Test-Based Program Repair , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[32]  Tegawendé F. Bissyandé,et al.  TBar: revisiting template-based automated program repair , 2019, ISSTA.

[33]  Claire Le Goues,et al.  JFIX: semantics-based repair of Java programs via symbolic PathFinder , 2017, ISSTA.

[34]  Alexey Zhikhartsev,et al.  Better test cases for better automated program repair , 2017, ESEC/SIGSOFT FSE.

[35]  David Lo,et al.  S3: syntax- and semantic-guided repair synthesis via programming by examples , 2017, ESEC/SIGSOFT FSE.

[36]  Gordon Fraser,et al.  EvoSuite: automatic test suite generation for object-oriented software , 2011, ESEC/FSE '11.

[37]  Marcelo de Almeida Maia,et al.  Dissection of a bug dataset: Anatomy of 395 patches from Defects4J , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[38]  Stephen McCamant,et al.  The Daikon system for dynamic detection of likely invariants , 2007, Sci. Comput. Program..

[39]  Gary T. Leavens,et al.  Design by Contract with JML , 2006 .

[40]  Fan Long,et al.  An analysis of patch plausibility and correctness for generate-and-validate patch generation systems , 2015, ISSTA.

[41]  Hongyu Zhang,et al.  Shaping program repair space with existing patches and similar code , 2018, ISSTA.

[42]  Rui Abreu,et al.  GZoltar: an eclipse plug-in for testing and debugging , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[43]  David Lo,et al.  Enhancing Automated Program Repair with Deductive Verification , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[44]  Claire Le Goues,et al.  Automatically finding patches using genetic programming , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[45]  Matias Martinez,et al.  A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark , 2018, 2019 IEEE 1st International Workshop on Intelligent Bug Fixing (IBF).

[46]  David Lo,et al.  Overfitting in semantics-based automated program repair , 2018, Empirical Software Engineering.

[47]  Lingming Zhang,et al.  PraPR: Practical Program Repair via Bytecode Mutation , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[48]  Westley Weimer,et al.  Leveraging program equivalence for adaptive program repair: Models and first results , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[49]  Lauretta O. Osho,et al.  Axiomatic Basis for Computer Programming , 2013 .

[50]  Dermot Cochran,et al.  The KOA Remote Voting System: A Summary of Work to Date , 2006, TGC.

[51]  Albert L. Baker,et al.  Preliminary design of JML: a behavioral interface specification language for java , 2006, SOEN.

[52]  Fan Long,et al.  Staged program repair with condition synthesis , 2015, ESEC/SIGSOFT FSE.

[53]  Corina S. Pasareanu,et al.  POSTER: AFL-based Fuzzing for Java with Kelinci , 2017, CCS.

[54]  Sarfraz Khurshid,et al.  An Empirical Study of Boosting Spectrum-Based Fault Localization via PageRank , 2021, IEEE Transactions on Software Engineering.

[55]  Patrick Cousot,et al.  Methods and Logics for Proving Programs , 1991, Handbook of Theoretical Computer Science, Volume B: Formal Models and Sematics.

[56]  Yuriy Brun,et al.  The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs , 2015, IEEE Transactions on Software Engineering.

[57]  Abhik Roychoudhury,et al.  Codeflaws: A Programming Competition Benchmark for Evaluating Automated Program Repair Tools , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[58]  Matias Martinez ASTOR: A Program Repair Library for Java , 2016 .

[59]  David Lo,et al.  Empirical Study on Synthesis Engines for Semantics-Based Program Repair , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[60]  Bertrand Meyer,et al.  Applying 'design by contract' , 1992, Computer.

[61]  Matias Martinez,et al.  Ultra-Large Repair Search Space with Automatically Mined Templates: The Cardumen Mode of Astor , 2017, SSBSE.

[62]  Abhik Roychoudhury,et al.  Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[63]  David Lo,et al.  History Driven Program Repair , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[64]  Matias Martinez,et al.  Test Case Generation for Program Repair: A Study of Feasibility and Effectiveness , 2017, ArXiv.

[65]  Sunghun Kim,et al.  Toward an understanding of bug fix patterns , 2009, Empirical Software Engineering.

[66]  Dawei Qi,et al.  SemFix: Program repair via semantic analysis , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[67]  Wei-Ngan Chin,et al.  Automatic Program Repair Using Formal Verification and Expression Templates , 2019, VMCAI.

[68]  Claire Le Goues,et al.  A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[69]  Matias Martinez,et al.  Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset , 2016, Empirical Software Engineering.

[70]  Jiachen Zhang,et al.  Precise Condition Synthesis for Program Repair , 2016, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[71]  Albert L. Baker,et al.  JML: A Notation for Detailed Design , 1999, Behavioral Specifications of Businesses and Systems.

[72]  Claire Le Goues,et al.  Automatic program repair with evolutionary computation , 2010, Commun. ACM.

[73]  Tegawendé F. Bissyandé,et al.  AVATAR: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations , 2018, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[74]  Roderick Bloem,et al.  Automated error localization and correction for imperative programs , 2011, 2011 Formal Methods in Computer-Aided Design (FMCAD).