Duplicate Pull Request Detection: When Time Matters

In open source communities (e.g., GitHub), developers frequently submit pull requests to fix bugs or add new features during development process. Since the process of pull request is uncoordinated and distributed, it causes massive duplication. Usually, only the first pull request qualified by reviewers can be merged to the main branch of the repository, and the others are regarded as duplication by maintainers. Since the duplication largely aggravates workloads of project reviewers and maintainers, the evolutionary process of open source repositories is delayed. To identify the duplicate pull requests automatically, Ren et al. proposed a state-of-the-art approach that models a pull request by nine features and determine whether a given request is duplicate with the other existing requests or not. Nevertheless, we notice that their approach overlooked the time factor which is a significant feature for the task. In this study, we investigate the influence of time factor and improve the pull request representation. We assume that two pull requests are more likely duplicate when their created time are close to each other. We verify the assumption based on 26 open source repositories from GitHub with over 100,000 pairs of pull requests. We integrate the time feature to the nine features proposed by Ren et al. and the experimental results show that it can substantially improve the performance of Ren et al.'s work by 14.36% and 11.93% in terms of F1-score@1 and F1-score@5, respectively.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[3]  Bin Wang,et al.  Automated support for classifying software failure reports , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[4]  Jürgen Bitzer,et al.  The Impact of Entry and Competition by Open Source Software on Innovation , 2005 .

[5]  Lyndon Hiew,et al.  Assisted Detection of Duplicate Bug Reports , 2006 .

[6]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[7]  Greg R. Vetter Open Source Licensing and Scattering Opportunism in Software Standards , 2007 .

[8]  Thomas Zimmermann,et al.  Extracting structural information from bug reports , 2008, MSR '08.

[9]  Nicholas Jalbert,et al.  Automated duplicate detection for bug tracking systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[10]  Thomas Zimmermann,et al.  Duplicate bug reports considered harmful … really? , 2008, 2008 IEEE International Conference on Software Maintenance.

[11]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[12]  Tao Xie,et al.  JDF: detecting duplicate bug reports in Jazz , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[13]  Ashish Sureka,et al.  Detecting Duplicate Bug Report Using Character N-Gram-Based Features , 2010, 2010 Asia Pacific Software Engineering Conference.

[14]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[15]  Neil A. Ernst,et al.  Code forking in open-source software: a requirements perspective , 2010, ArXiv.

[16]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[17]  Jian Zhou,et al.  Learning to rank duplicate bug reports , 2012, CIKM.

[18]  Ladan Tahvildari,et al.  A Comparative Study of the Performance of IR Models on Duplicate Bug Detection , 2012, 2012 16th European Conference on Software Maintenance and Reengineering.

[19]  David Lo,et al.  Improved Duplicate Bug Report Identification , 2012, 2012 16th European Conference on Software Maintenance and Reengineering.

[20]  Mira Mezini,et al.  Finding Duplicates of Your Yet Unwritten Bug Report , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[21]  Krzysztof Czarnecki,et al.  An Exploratory Study of Cloning in Industrial Software Product Lines , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[22]  Eleni Stroulia,et al.  A contextual approach towards more accurate duplicate bug report detection and ranking , 2016, 2013 10th Working Conference on Mining Software Repositories (MSR).

[23]  Ladan Tahvildari,et al.  Search-based duplicate defect detection: An industrial experience , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[24]  Arie van Deursen,et al.  An exploratory study of the pull-based software development model , 2014, ICSE.

[25]  Per Runeson,et al.  A replicated study on duplicate detection: using apache lucene to search among Android defects , 2014, ESEM '14.

[26]  Bonita Sharif,et al.  Improving the accuracy of duplicate bug report detection using textual similarity measures , 2014, MSR 2014.

[27]  David Lo,et al.  DupFinder: integrated tool support for duplicate bug report detection , 2014, ASE.

[28]  Nicholas A. Kraft,et al.  New features for duplicate bug detection , 2014, MSR 2014.

[29]  Andrzej Wasowski,et al.  Forked and integrated variants in an open-source firmware project , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[30]  Yuki Manabe,et al.  Can We Detect Bug Report Duplication with Unfinished Bug Reports? , 2015, 2015 Asia-Pacific Software Engineering Conference (APSEC).

[31]  Eleni Stroulia,et al.  Detecting duplicate bug reports with software engineering domain knowledge , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[32]  Bram Adams,et al.  The impact of cross-distribution bug duplicates, empirical study on Debian and Ubuntu , 2015, 2015 IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[33]  Ahmed E. Hassan,et al.  Studying the needed effort for identifying duplicate issues , 2015, Empirical Software Engineering.

[34]  Cheng-Zen Yang,et al.  Enhancements for duplication detection in bug reports with manifold correlation features , 2016, J. Syst. Softw..

[35]  Chanchal Kumar Roy,et al.  Mining Duplicate Questions of Stack Overflow , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[36]  Gang Yin,et al.  Detecting Duplicate Pull-requests in GitHub , 2017, Internetware.

[37]  David Lo,et al.  Prediction of relatedness in stack overflow: deep learning vs. SVM: a reproducibility study , 2018, ESEM.

[38]  Gang Yin,et al.  A Dataset of Duplicate Pull-Requests in GitHub , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[39]  Cor-Paul Bezemer,et al.  Revisiting the Performance Evaluation of Automated Approaches for the Retrieval of Duplicate Issue Reports , 2018, IEEE Transactions on Software Engineering.

[40]  Abram Hindle,et al.  Preventing duplicate bug reports by continuously querying bug reports , 2018, Empirical Software Engineering.

[41]  Andrzej Wasowski,et al.  Identifying Redundancies in Fork-based Development , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[42]  David Lo,et al.  Why is my code change abandoned? , 2019, Inf. Softw. Technol..