论文信息 - BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes

BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes

Fault-detection, localization, and repair methods are vital to software quality; but it is difficult to evaluate their generality, applicability, and current effectiveness. Large, diverse, realistic datasets of durably-reproducible faults and fixes are vital to good experimental evaluation of approaches to software quality, but they are difficult and expensive to assemble and keep current. Modern continuous-integration (CI) approaches, like TRAVIS-CI, which are widely used, fully configurable, and executed within custom-built containers, promise a path toward much larger defect datasets. If we can identify and archive failing and subsequent passing runs, the containers will provide a substantial assurance of durable future reproducibility of build and test. Several obstacles, however, must be overcome to make this a practical reality. We describe BUGSWARM, a toolset that navigates these obstacles to enable the creation of a scalable, diverse, realistic, continuously growing set of durably reproducible failing and passing versions of real-world, open-source systems. The BUGSWARM toolkit has already gathered 3,091 fail-pass pairs, in Java and Python, all packaged within fully reproducible containers. Furthermore, the toolkit can be run periodically to detect fail-pass activities, thus growing the dataset continually.

[1] Georgios Gousios,et al. Work Practices and Challenges in Pull-Based Development: The Integrator's Perspective , 2014, ICSE.

[2] Michael D. Ernst,et al. Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[3] Thomas J. Ostrand,et al. Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria , 1994, Proceedings of 16th International Conference on Software Engineering.

[4] Gregg Rothermel,et al. Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact , 2005, Empirical Software Engineering.

[5] David A. Tomassi. Bugs in the wild: examining the effectiveness of static analyzers at finding real-world bugs , 2018, ESEC/SIGSOFT FSE.

[6] Georgios Gousios,et al. Work practices and challenges in pull-based development: the contributor's perspective , 2015, ICSE.

[7] Premkumar T. Devanbu,et al. Quality and productivity outcomes relating to continuous integration in GitHub , 2015, ESEC/SIGSOFT FSE.

[8] Yuanyuan Zhou,et al. BugBench: Benchmarks for Evaluating Bug Detection Tools , 2005 .

[9] Darko Marinov,et al. An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[10] Wing Lam,et al. Bugs.jar: A Large-Scale, Diverse Dataset of Real-World Java Bugs , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[11] Thomas Zimmermann,et al. Extraction of bug localization benchmarks from history , 2007, ASE.

[12] Georgios Gousios,et al. TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[13] Simon Urli,et al. How to Design a Program Repair Bot? Insights from the Repairnator Project , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[14] Yuriy Brun,et al. The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs , 2015, IEEE Transactions on Software Engineering.

[15] Marcelo de Almeida Maia,et al. BEARS: An Extensible Java Bug Benchmark for Automatic Program Repair Studies , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[16] Susan Stepney,et al. BugZoo: a platform for studying software bugs , 2018, ICSE.

[17] Erica Mealy,et al. BegBunch: benchmarking for C bug detection tools , 2009, DEFECTS '09.