BEARS: An Extensible Java Bug Benchmark for Automatic Program Repair Studies

Benchmarks of bugs are essential to empirically evaluate automatic program repair tools. In this paper, we present BEARS, a project for collecting and storing bugs into an extensible bug benchmark for automatic repair studies in Java. The collection of bugs relies on commit building state from Continuous Integration (CI) to find potential pairs of buggy and patched program versions from open-source projects hosted on GitHub. Each pair of program versions passes through a pipeline where an attempt of reproducing a bug and its patch is performed. The core step of the reproduction pipeline is the execution of the test suite of the program on both program versions. If a test failure is found in the buggy program version candidate and no test failure is found in its patched program version candidate, a bug and its patch were successfully reproduced. The uniqueness of Bears is the usage of CI (builds) to identify buggy and patched program version candidates, which has been widely adopted in the last years in open-source projects. This approach allows us to collect bugs from a diversity of projects beyond mature projects that use bug tracking systems. Moreover, BEARS was designed to be publicly available and to be easily extensible by the research community through automatic creation of branches with bugs in a given GitHub repository, which can be used for pull requests in the BEARS repository. We present in this paper the approach employed by BEARS, and we deliver the version 1.0 of BEARS, which contains 251 reproducible bugs collected from 72 projects that use the Travis CI and Maven build environment.

[1]  Martin Monperrus,et al.  Dynamic patch generation for null pointer exceptions using metaprogramming , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[2]  Simon Urli,et al.  How to Design a Program Repair Bot? Insights from the Repairnator Project , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[3]  Yuanyuan Zhou,et al.  BugBench: Benchmarks for Evaluating Bug Detection Tools , 2005 .

[4]  Gregg Rothermel,et al.  Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact , 2005, Empirical Software Engineering.

[5]  Matias Martinez,et al.  ASTOR: a program repair library for Java (demo) , 2016, ISSTA.

[6]  Georgios Gousios,et al.  Oops, My Tests Broke the Build: An Explorative Analysis of Travis CI with GitHub , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[7]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[8]  Thomas Zimmermann,et al.  Extraction of bug localization benchmarks from history , 2007, ASE.

[9]  Martin Monperrus,et al.  Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs , 2018, IEEE Transactions on Software Engineering.

[10]  Andrew Hunt,et al.  Pragmatic Unit Testing in Java with JUnit , 2003 .

[11]  Claire Le Goues,et al.  Current challenges in automatic software repair , 2013, Software Quality Journal.

[12]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[13]  Abhik Roychoudhury,et al.  Codeflaws: A Programming Competition Benchmark for Evaluating Automated Program Repair Tools , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[14]  Marcelo de Almeida Maia,et al.  Dissection of a bug dataset: Anatomy of 395 patches from Defects4J , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[15]  Martin Monperrus,et al.  Automatic Software Repair , 2018, ACM Comput. Surv..

[16]  Matias Martinez ASTOR: A Program Repair Library for Java , 2016 .

[17]  Georgios Gousios,et al.  TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[18]  Thomas J. Ostrand,et al.  Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria , 1994, Proceedings of 16th International Conference on Software Engineering.

[19]  Yuriy Brun,et al.  The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs , 2015, IEEE Transactions on Software Engineering.

[20]  Wing Lam,et al.  Bugs.jar: A Large-Scale, Diverse Dataset of Real-World Java Bugs , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[21]  Armando Solar-Lezama,et al.  QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge , 2017, SPLASH.