A fine-grained data set and analysis of tangling in bug fixing commits

Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17\% and 32\% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66\% to 87\%. We find that about 11\% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3\% to 47\% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.

Christoph Treude | Helge Spieker | Alexander Serebrenik | Taher Ahmed Ghaleb | Kuljit Kaur Chahal | Simon Eismann | Ivano Malavolta | Valentina Lenarduzzi | Johannes Erbel | Roberto Verdecchia | Ella Albrecht | Shangwen Wang | Alexander Trautsch | Steffen Herbold | Benjamin Ledel | Idan Amit | Philip Makedonski | Marvin Wyrich | Anna-Katharina Wickert | Debasish Chakroborti | Diego Marcilio | Ricardo Colomo-Palacios | Vijay Walunj | Matej Madeja | Alireza Aghamohammadi | Ivan Pashchenko | James Davis | Austin Z. Henley | Bhaveet Nagaria | Daniel Struber | Burak Turhan | Omar Alam | Ethem Utku Aktas | Abdullah Aldaeej | Tim Bossenmaier | Matin Nili Ahmadabadi | Kristof Szabados | Nathaniel Hoy | Gema Rodr'iguez-P'erez | Paramvir Singh | Yihao Qin | Willard Davis | Hongjun Wu | Matus Sulir | Fatemeh Fard | Stratos Kourtzanidis | Eray Tuzun | Simin Maleki Shamasbi | E. Tuzun | Christoph Treude | Marvin Wyrich | I. Malavolta | Roberto Verdecchia | Alexander Serebrenik | R. Colomo‐Palacios | Yihao Qin | Ivan Pashchenko | Matej Madeja | Matúš Sulír | Valentina Lenarduzzi | Philip Makedonski | S. Herbold | K. Chahal | Alexander Trautsch | J. Erbel | Burak Turhan | Shangwen Wang | Paramvir Singh | Simon Eismann | V. Walunj | Omar Alam | Diego Marcilio | Helge Spieker | Benjamin Ledel | Kristóf Szabados | Debasish Chakroborti | Abdullah Aldaeej | Gema Rodr'iguez-P'erez | T. A. Ghaleb | Idan Amit | Hongjun Wu | Willard Davis | Alireza Aghamohammadi | Tim Bossenmaier | Bhaveet Nagaria | Nathaniel Hoy | Anna-Katharina Wickert | Fatemeh H. Fard | Stratos Kourtzanidis | James Davis | Ella Albrecht | Daniel Struber

[1]  Gabriele Bavota,et al.  On the relationship between bug reports and queries for text retrieval-based bug localization , 2020, Empirical Software Engineering.

[2]  Andreas Zeller,et al.  The impact of tangled code changes , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[3]  David Lo,et al.  Empirical Evaluation of Bug Linking , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[4]  Thorsten Berger,et al.  Feature-oriented defect prediction , 2020, SPLC.

[5]  M. Patton Qualitative Research & Evaluation Methods: Integrating Theory and Practice , 2014 .

[6]  Jure Leskovec,et al.  Steering user behavior with badges , 2013, WWW.

[7]  Per Runeson,et al.  Guidelines for conducting and reporting case study research in software engineering , 2009, Empirical Software Engineering.

[8]  Paul Ralph,et al.  Sampling in Software Engineering Research: A Critical Review and Guidelines , 2020, ArXiv.

[9]  Shinpei Hayashi,et al.  ChangeBeadsThreader: An Interactive Environment for Tailoring Automatically Untangled Changes , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[10]  Andreas Zeller,et al.  The impact of tangled code changes on defect prediction models , 2015, Empirical Software Engineering.

[11]  Alexander Trautsch,et al.  Large-Scale Manual Validation of Bugfixing Changes , 2020, MSR.

[12]  T. Cook,et al.  Quasi-experimentation: Design & analysis issues for field settings , 1979 .

[13]  Shinji Kusumoto,et al.  Hey! are you committing tangled changes? , 2014, ICPC 2014.

[14]  Shinji Kusumoto,et al.  Splitting Commits via Past Code Changes , 2016, 2016 23rd Asia-Pacific Software Engineering Conference (APSEC).

[15]  Min Wang,et al.  CoRA: Decomposing and Describing Tangled Code Changes for Reviewer , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[16]  Danny Dig,et al.  Accurate and Efficient Refactoring Detection in Commit History , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[17]  Gregor Kiczales,et al.  Aspect-oriented programming , 2001, ESEC/FSE-9.

[18]  Miltiadis Allamanis,et al.  Flexeme: untangling commits using lexical flows , 2020, ESEC/SIGSOFT FSE.

[19]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[20]  K. Werbach,et al.  For the Win: How Game Thinking Can Revolutionize Your Business , 2012 .

[21]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[22]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[23]  Rudolf Ferenc,et al.  BugsJS: a Benchmark of JavaScript Bugs , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[24]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[25]  M. Kessentini,et al.  A Systematic Literature Review , 2016 .

[26]  Daniel M. Germán,et al.  What do large commits tell us?: a taxonomical study of large commits , 2008, MSR '08.

[27]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[28]  Fabian Trautsch,et al.  The SmartSHARK Ecosystem for Software Repository Mining , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[29]  Scott Grant,et al.  Encouraging user behaviour with achievements: An empirical study , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[30]  Yuriy Brun,et al.  The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs , 2015, IEEE Transactions on Software Engineering.

[31]  L. Brown,et al.  Interval Estimation for a Binomial Proportion , 2001 .

[32]  Mark Harman,et al.  An Analysis and Survey of the Development of Mutation Testing , 2011, IEEE Transactions on Software Engineering.

[33]  Georgios Gousios,et al.  Untangling fine-grained code changes , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[34]  Martin P. Robillard,et al.  Non-essential changes in version histories , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[35]  David Lo,et al.  Potential biases in bug localization: do they matter? , 2014, ASE.

[36]  C. Dunnett A Multiple Comparison Procedure for Comparing Several Treatments with a Control , 1955 .

[37]  Gabriele Bavota,et al.  Are Bug Reports Enough for Text Retrieval-Based Bug Localization? , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[38]  Wing Lam,et al.  Bugs.jar: A Large-Scale, Diverse Dataset of Real-World Java Bugs , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[39]  Gregg Rothermel,et al.  Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact , 2005, Empirical Software Engineering.

[40]  A. Agresti,et al.  Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions , 1998 .

[41]  Thomas J. Ostrand,et al.  Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria , 1994, Proceedings of 16th International Conference on Software Engineering.

[42]  Fabian Trautsch,et al.  Issues with SZZ: An empirical assessment of the state of practice of defect prediction data collection , 2019, ArXiv.

[43]  M J Campbell,et al.  Statistics in Medicine: Calculating confidence intervals for some non-parametric analyses , 1988 .

[44]  Fabian Trautsch,et al.  Addressing problems with replicability and validity of repository mining studies through a smart data platform , 2018, Empirical Software Engineering.

[45]  Jens Grabowski,et al.  A longitudinal study of static analysis warning evolution and the effects of PMD on software quality in Apache open source projects , 2019, Empirical Software Engineering.

[46]  Anh Tuan Nguyen,et al.  Filtering noise in mixed-purpose fixing commits to improve defect prediction and localization , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[47]  Justin Cappos,et al.  Prevalence of Confusing Code in Software Projects: Atoms of Confusion in the Wild , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[48]  S. Williams,et al.  Pearson's correlation coefficient. , 1996, The New Zealand medical journal.

[49]  Audris Mockus,et al.  A large-scale empirical study of just-in-time quality assurance , 2013, IEEE Transactions on Software Engineering.

[50]  Shinji Kusumoto,et al.  A Study on Inappropriately Partitioned Commits — How Much and What Kinds of IP Commits in Java Projects? — , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[51]  Burak Turhan,et al.  A Systematic Literature Review and Meta-Analysis on Cross Project Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[52]  Andreas Zeller,et al.  It's not a bug, it's a feature: How misclassification impacts bug prediction , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[53]  Michael Philippsen,et al.  Automatic Clustering of Code Changes , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[54]  Steffen Herbold,et al.  With Registered Reports Towards Large Scale Data Curation , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).

[55]  Sunghun Kim,et al.  Partitioning Composite Code Changes to Facilitate Code Review , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[56]  Daniela Micucci,et al.  Automatic Software Repair: A Survey , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[57]  Thomas Zimmermann,et al.  Automatic Identification of Bug-Introducing Changes , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[58]  Uirá Kulesza,et al.  The impact of refactoring changes on the SZZ algorithm: An empirical study , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[59]  Andreas Zeller,et al.  When do changes induce fixes? , 2005, ACM SIGSOFT Softw. Eng. Notes.

[60]  Alain Abran,et al.  A systematic literature review: Opinion mining studies from mobile app store user reviews , 2017, J. Syst. Softw..

[61]  Scott B. MacKenzie,et al.  Common method biases in behavioral research: a critical review of the literature and recommended remedies. , 2003, The Journal of applied psychology.

[62]  Daniel M. Germán,et al.  How bugs are born: a model to identify how bugs are introduced in software components , 2020, Empirical Software Engineering.