Automated Patch Correctness Assessment: How Far are We?

Test-based automated program repair (APR) has attracted huge attention from both industry and academia. Despite the significant progress made in recent studies, the overfitting problem (i.e., the generated patch is plausible but overfitting) is still a major and long-standing challenge. Therefore, plenty of techniques have been proposed to assess the correctness of patches either in the patch generation phase or in the evaluation of APR techniques. However, the effectiveness of existing techniques has not been systematically compared and little is known to their advantages and disadvantages. To fill this gap, we performed a large-scale empirical study in this paper. Specifically, we systematically investigated the effectiveness of existing automated patch correctness assessment techniques, including both static and dynamic ones, based on 902 patches automatically generated by 21 APR tools from 4 different categories. Our empirical study revealed the following major findings: (1) static code features with respect to patch syntax and semantics are generally effective in differentiating overfitting patches over correct ones; (2) dynamic techniques can generally achieve high precision while heuristics based on static code features are more effective towards recall; (3) existing techniques are more effective towards certain projects and types of APR techniques while less effective to the others; (4) existing techniques are highly complementary to each other. For instance, a single technique can only detect at most 53.5% of the overfitting patches while 93.3% of them can be detected by at least one technique when the oracle information is available. Based on our findings, we designed an integration strategy to first integrate static code features via learning, and then combine with others by the majority voting strategy. Our experiments show that the strategy can enhance the performance of existing patch correctness assessment techniques significantly.

[1]  Cristian Cadar,et al.  KATCH: high-coverage testing of software patches , 2013, ESEC/FSE 2013.

[2]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[3]  Hiroaki Yoshida,et al.  Anti-patterns in search-based program repair , 2016, SIGSOFT FSE.

[4]  Martin Monperrus,et al.  Automated Patch Assessment for Program Repair at Scale , 2019, ArXiv.

[5]  Ming Wen,et al.  How Different Is It Between Machine-Generated and Developer-Provided Patches? : An Empirical Study on the Correct Patches Generated by Automated Program Repair Techniques , 2019, 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[6]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[7]  Yuriy Brun,et al.  Is the cure worse than the disease? overfitting in automated program repair , 2015, ESEC/SIGSOFT FSE.

[8]  Westley Weimer,et al.  Understanding Automatically-Generated Patches Through Symbolic Invariant Differences , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Claire Le Goues,et al.  GenProg: A Generic Method for Automatic Software Repair , 2012, IEEE Transactions on Software Engineering.

[10]  Hang Li,et al.  A Short Introduction to Learning to Rank , 2011, IEICE Trans. Inf. Syst..

[11]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[12]  David Lo,et al.  S3: syntax- and semantic-guided repair synthesis via programming by examples , 2017, ESEC/SIGSOFT FSE.

[13]  László Vidács,et al.  Utilizing Source Code Embeddings to Identify Correct Patches , 2020, 2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF).

[14]  Gordon Fraser,et al.  A Memetic Algorithm for whole test suite generation , 2015, J. Syst. Softw..

[15]  Westley Weimer,et al.  Leveraging program equivalence for adaptive program repair: Models and first results , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[16]  Xia Li,et al.  On the Effectiveness of Unified Debugging: An Extensive Study on 16 Program Repair Systems , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[17]  Ming Wen,et al.  Context-Aware Patch Generation for Better Automated Program Repair , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[18]  Claire Le Goues,et al.  Using a probabilistic model to predict bug fixes , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[19]  Lingming Zhang,et al.  Practical program repair via bytecode mutation , 2018, ISSTA.

[20]  Wolfgang Banzhaf,et al.  ARJA: Automated Repair of Java Programs via Multi-Objective Genetic Programming , 2017, IEEE Transactions on Software Engineering.

[21]  Claire Le Goues,et al.  Automatically finding patches using genetic programming , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[22]  Johannes Bader,et al.  Getafix: learning to fix bugs automatically , 2019, Proc. ACM Program. Lang..

[23]  Fan Long,et al.  An Analysis of the Search Spaces for Generate and Validate Patch Generation Systems , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[24]  Carlo A. Furia,et al.  Contract-based program repair without the contracts , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[25]  Luciano Baresi,et al.  TestFul: An Evolutionary Test Approach for Java , 2010, 2010 Third International Conference on Software Testing, Verification and Validation.

[26]  Inderjeet Singh A Mapping Study of Automation Support Tools for Unit Testing , 2012 .

[27]  Ming Wen,et al.  Locus: Locating bugs from software changes , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[28]  Jacques Klein,et al.  You Cannot Fix What You Cannot Find! An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems , 2018, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[29]  Jaechang Nam,et al.  Automatic patch generation learned from human-written patches , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[30]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[31]  Claire Le Goues,et al.  Automated program repair , 2019, Commun. ACM.

[32]  Yuhua Qi,et al.  The strength of random search on automated program repair , 2014, ICSE.

[33]  David Lo,et al.  A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects , 2021, IEEE Transactions on Software Engineering.

[34]  Tegawendé F. Bissyandé,et al.  LSRepair: Live Search of Fix Ingredients for Automated Program Repair , 2018, 2018 25th Asia-Pacific Software Engineering Conference (APSEC).

[35]  Rui Abreu,et al.  Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts , 2019, ESEC/SIGSOFT FSE.

[36]  Ali Ghanbari Validation of Automatically Generated Patches: An Appetizer , 2019, ArXiv.

[37]  Dawei Qi,et al.  SemFix: Program repair via semantic analysis , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[38]  Tegawendé F. Bissyandé,et al.  AVATAR: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations , 2018, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[39]  Rongxin Wu,et al.  Historical Spectrum Based Fault Localization , 2019, IEEE Transactions on Software Engineering.

[40]  Jiachen Zhang,et al.  Precise Condition Synthesis for Program Repair , 2016, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[41]  Yannis Smaragdakis,et al.  JCrasher: an automatic robustness tester for Java , 2004, Softw. Pract. Exp..

[42]  Kristin L. Sainani,et al.  Logistic Regression , 2014, PM & R : the journal of injury, function, and rehabilitation.

[43]  Qi Xin,et al.  Identifying test-suite-overfitted patches through test case generation , 2017, ISSTA.

[44]  Gang Huang,et al.  Identifying Patch Correctness in Test-Based Program Repair , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[45]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[46]  Jacques Klein,et al.  On the Efficiency of Test Suite based Program Repair A Systematic Assessment of 16 Automated Repair Systems for Java Programs , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[47]  Alexey Zhikhartsev,et al.  Better test cases for better automated program repair , 2017, ESEC/SIGSOFT FSE.

[48]  Kazi Sakib,et al.  Impact Analysis of Syntactic and Semantic Similarities on Patch Prioritization in Automated Program Repair , 2019, 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[49]  Martin Monperrus,et al.  DynaMoth: Dynamic Code Synthesis for Automatic Program Repair , 2016, 2016 IEEE/ACM 11th International Workshop in Automation of Software Test (AST).

[50]  Martin Monperrus,et al.  Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs , 2018, IEEE Transactions on Software Engineering.

[51]  Mark Harman,et al.  SapFix: Automated End-to-End Repair at Scale , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[52]  Fan Long,et al.  Staged program repair with condition synthesis , 2015, ESEC/SIGSOFT FSE.

[53]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[54]  Hao Zhong,et al.  Mining stackoverflow for program repair , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[55]  David Lo,et al.  Overfitting in semantics-based automated program repair , 2018, Empirical Software Engineering.

[56]  Wojciech Ziarko,et al.  Machine Learning Through Data Classification and Reduction , 1997, Fundam. Informaticae.

[57]  Tina R. Patil,et al.  Performance Analysis of Naive Bayes and J 48 Classification Algorithm for Data Classification , 2013 .

[58]  Martin Monperrus,et al.  Automated Classification of Overfitting Patches with Statically Extracted Code Features , 2019, ArXiv.

[59]  Matias Martinez ASTOR: A Program Repair Library for Java , 2016 .

[60]  Jacques Klein,et al.  FixMiner: Mining relevant fix patterns for automated program repair , 2018, Empirical Software Engineering.

[61]  Denys Poshyvanyk,et al.  SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair , 2018, IEEE Transactions on Software Engineering.

[62]  Bo Yang,et al.  Exploring the Differences between Plausible and Correct Patches at Fine-Grained Level , 2020, 2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF).

[63]  Xiang Chen,et al.  Improving defect prediction with deep forest , 2019, Inf. Softw. Technol..

[64]  Jacques Klein,et al.  Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[65]  Hongyu Zhang,et al.  Shaping program repair space with existing patches and similar code , 2018, ISSTA.

[66]  Ping Ma,et al.  Can This Fault Be Detected by Automated Test Generation: A Preliminary Study , 2020, 2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF).

[67]  Tegawendé F. Bissyandé,et al.  TBar: revisiting template-based automated program repair , 2019, ISSTA.

[68]  Lu Zhang,et al.  Predictive Mutation Testing , 2016, IEEE Transactions on Software Engineering.

[69]  Qi Xin,et al.  Leveraging syntax-related code for automated program repair , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[70]  Fan Long,et al.  An analysis of patch plausibility and correctness for generate-and-validate patch generation systems , 2015, ISSTA.

[71]  Matias Martinez,et al.  Ultra-Large Repair Search Space with Automatically Mined Templates: The Cardumen Mode of Astor , 2017, SSBSE.

[72]  Gordon Fraser,et al.  Do Automatically Generated Unit Tests Find Real Faults? An Empirical Study of Effectiveness and Challenges (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[73]  David Lo,et al.  History Driven Program Repair , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[74]  Michael D. Ernst,et al.  Randoop: feedback-directed random testing for Java , 2007, OOPSLA '07.

[75]  Matias Martinez,et al.  Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system , 2018, Empirical Software Engineering.

[76]  L. Penrose The Elementary Statistics of Majority Voting , 1946 .

[77]  Sarfraz Khurshid,et al.  Towards Practical Program Repair with On-demand Candidate Generation , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[78]  Yingfei Xiong,et al.  A manual inspection of Defects4J bugs and its implications for automatic program repair , 2019, Science China Information Sciences.

[79]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[80]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[81]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[82]  Gordon Fraser,et al.  Whole Test Suite Generation , 2013, IEEE Transactions on Software Engineering.

[83]  Gordon Fraser,et al.  EvoSuite: automatic test suite generation for object-oriented software , 2011, ESEC/FSE '11.

[84]  David Lo,et al.  On Reliability of Patch Correctness Assessment , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[85]  Abhik Roychoudhury,et al.  Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[86]  Xia Li,et al.  Can automated program repair refine fault localization? a unified debugging approach , 2020, ISSTA.

[87]  Shangwen Wang,et al.  Attention Please: Consider Mockito when Evaluating Newly Proposed Automated Program Repair Techniques , 2018, EASE.

[88]  Martin Monperrus,et al.  Test case purification for improving fault localization , 2014, SIGSOFT FSE.

[89]  Claire Le Goues,et al.  Leveraging Program Invariants to Promote Population Diversity in Search-Based Automatic Program Repair , 2019, 2019 IEEE/ACM International Workshop on Genetic Improvement (GI).

[90]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[91]  David Lo,et al.  Empirical Study on Synthesis Engines for Semantics-Based Program Repair , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).