Is the Ground Truth Really Accurate? Dataset Purification for Automated Program Repair

Datasets of real-world bugs shipped with human-written patches are intensively used in the evaluation of existing automated program repair (APR) techniques, wherein the human-written patches always serve as the ground truth, for manual or automated assessment approaches, to evaluate the correctness of test-suite adequate patches. An inaccurate human-written patch tangled with other code changes will pose threats to the reliability of the assessment results. Therefore, the construction of such datasets always requires much manual effort on isolating real bug fixes from bug fixing commits. However, the manual work is time-consuming and prone to mistakes, and little has been known on whether the ground truth in such datasets is really accurate.In this paper, we propose DEPTEST, an automated DatasEt Purification technique from the perspective of triggering Tests. Leveraging coverage analysis and delta debugging, DEPTEST can automatically identify and filter out the code changes irrelevant to the bug exposed by triggering tests. To measure the strength of DEPTEST, we run it on the most extensively used dataset (i.e., Defects4J) that claims to already exclude all irrelevant code changes for each bug fix via manual purification. Our experiment indicates that even in a dataset where the bug fix is claimed to be well isolated, 41.01% of human-written patches can be further reduced by 4.3 lines on average, with the largest reduction reaching up to 53 lines. This indicates its great potential in assisting in the construction of datasets of accurate bug fixes. Furthermore, based on the purified patches, we re-dissect Defects4J and systematically revisit the APR of multi-chunk bugs to provide insights for future research targeting such bugs.

[1]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[2]  Andreas Zeller,et al.  Yesterday, my program worked. Today, it does not. Why? , 1999, ESEC/FSE-7.

[3]  Marcelo de Almeida Maia,et al.  BEARS: An Extensible Java Bug Benchmark for Automatic Program Repair Studies , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[4]  Fan Long,et al.  Automatic patch generation by learning correct code , 2016, POPL.

[5]  Yingfei Xiong,et al.  Automated program repair: a step towards software automation , 2019, Science China Information Sciences.

[6]  Wolfgang Banzhaf,et al.  Making Better Use of Repair Templates in Automated Program Repair: A Multi-Objective Approach , 2020 .

[7]  Jacques Klein,et al.  You Cannot Fix What You Cannot Find! An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems , 2018, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[8]  Lingming Zhang,et al.  Practical program repair via bytecode mutation , 2018, ISSTA.

[9]  Armando Solar-Lezama,et al.  QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge , 2017, SPLASH.

[10]  Martin Monperrus,et al.  Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs , 2018, IEEE Transactions on Software Engineering.

[11]  Yuhua Qi,et al.  The strength of random search on automated program repair , 2014, ICSE.

[12]  Ming Wen,et al.  How Different Is It Between Machine-Generated and Developer-Provided Patches? : An Empirical Study on the Correct Patches Generated by Automated Program Repair Techniques , 2019, 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[13]  Andreas Zeller,et al.  Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[14]  Tegawendé F. Bissyandé,et al.  TBar: revisiting template-based automated program repair , 2019, ISSTA.

[15]  Andreas Zeller,et al.  The impact of tangled code changes , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[16]  Westley Weimer,et al.  Understanding Automatically-Generated Patches Through Symbolic Invariant Differences , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[17]  Alexander Trautsch,et al.  Large-Scale Manual Validation of Bugfixing Changes , 2020, MSR.

[18]  Claire Le Goues,et al.  Leveraging Program Invariants to Promote Population Diversity in Search-Based Automatic Program Repair , 2019, 2019 IEEE/ACM International Workshop on Genetic Improvement (GI).

[19]  Qi Xin,et al.  Leveraging syntax-related code for automated program repair , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[20]  Claire Le Goues,et al.  Automatically finding patches using genetic programming , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[21]  Shaohua Wang,et al.  DLFix: Context-based Code Transformation Learning for Automated Program Repair , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[22]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[23]  Marcelo de Almeida Maia,et al.  Dissection of a bug dataset: Anatomy of 395 patches from Defects4J , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[24]  Wing Lam,et al.  Bugs.jar: A Large-Scale, Diverse Dataset of Real-World Java Bugs , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[25]  Claire Le Goues,et al.  GenProg: A Generic Method for Automatic Software Repair , 2012, IEEE Transactions on Software Engineering.

[26]  Gabriele Bavota,et al.  Are Bug Reports Enough for Text Retrieval-Based Bug Localization? , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[27]  Yingfei Xiong,et al.  Inferring Program Transformations From Singular Examples via Big Code , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[28]  Qi Xin,et al.  Identifying test-suite-overfitted patches through test case generation , 2017, ISSTA.

[29]  Rui Abreu,et al.  Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts , 2019, ESEC/SIGSOFT FSE.

[30]  Hongyu Zhang,et al.  Shaping program repair space with existing patches and similar code , 2018, ISSTA.

[31]  Rui Abreu,et al.  GZoltar: an eclipse plug-in for testing and debugging , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[32]  David Lo,et al.  On Reliability of Patch Correctness Assessment , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[33]  Seemanta Saha,et al.  Harnessing Evolution for Multi-Hunk Program Repair , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[34]  Wolfgang Banzhaf,et al.  ARJA: Automated Repair of Java Programs via Multi-Objective Genetic Programming , 2017, IEEE Transactions on Software Engineering.

[35]  Martin Monperrus,et al.  IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs , 2016 .

[36]  Gregory Gay,et al.  Defects4J as a Challenge Case for the Search-Based Software Engineering Community , 2020, SSBSE.

[37]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[38]  Carlo A. Furia,et al.  Contract-based program repair without the contracts , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[39]  Sarfraz Khurshid,et al.  Specification-Based Program Repair Using SAT , 2011, TACAS.

[40]  Tegawendé F. Bissyandé,et al.  LSRepair: Live Search of Fix Ingredients for Automated Program Repair , 2018, 2018 25th Asia-Pacific Software Engineering Conference (APSEC).

[41]  Cyrille Artho,et al.  Iterative delta debugging , 2009, International Journal on Software Tools for Technology Transfer.

[42]  Jiachen Zhang,et al.  Precise Condition Synthesis for Program Repair , 2016, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[43]  Bo Lin,et al.  Automated Patch Correctness Assessment: How Far are We? , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[44]  Matias Martinez ASTOR: A Program Repair Library for Java , 2016 .

[45]  Monperrus Martin Automatic Software Repair: a Bibliography , 2020 .

[46]  Sunghun Kim,et al.  Automatic patch generation with context-based change application , 2019, Empirical Software Engineering.

[47]  Matias Martinez,et al.  Automated patch assessment for program repair at scale , 2019, Empirical Software Engineering.

[48]  Roderick Bloem,et al.  Automated error localization and correction for imperative programs , 2011, 2011 Formal Methods in Computer-Aided Design (FMCAD).

[49]  David Lo,et al.  History Driven Program Repair , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[50]  Shuvendu K. Lahiri,et al.  Helping Developers Help Themselves: Automatic Decomposition of Code Review Changesets , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[51]  Ming Wen,et al.  Context-Aware Patch Generation for Better Automated Program Repair , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[52]  Yitong Li,et al.  CoCoNuT: combining context-aware neural translation models using ensemble for program repair , 2020, ISSTA.

[53]  Dawei Qi,et al.  SemFix: Program repair via semantic analysis , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[54]  Bo Yang,et al.  Exploring the Differences between Plausible and Correct Patches at Fine-Grained Level , 2020, 2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF).

[55]  Daniela Micucci,et al.  Automatic Software Repair: A Survey , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[56]  Xiaoguang Mao,et al.  Multi-Location Program Repair Strategies Learned from Past Successful Experience , 2018 .

[57]  Matias Martinez,et al.  Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system , 2018, Empirical Software Engineering.

[58]  Claire Le Goues,et al.  Automated program repair , 2019, Commun. ACM.

[59]  Yuriy Brun,et al.  The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs , 2015, IEEE Transactions on Software Engineering.

[60]  Tegawendé F. Bissyandé,et al.  AVATAR: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations , 2018, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).