Detecting Duplicate Pull-requests in GitHub

The widespread use of pull-requests boosts the development and evolution for many open source software projects. However, due to the parallel and uncoordinated nature of development process in GitHub, duplicate pull-requests may be submitted by different contributors to solve the same problem. Duplicate pull-requests increase the maintenance cost of GitHub, result in the waste of time spent on the redundant effort of code review, and even frustrate developers' willing to offer continuous contribution. In this paper, we investigate using text information to automatically detect duplicate pull-requests in GitHub. For a new-arriving pull-request, we compare the textual similarity between it and other existing pull-requests, and then return a candidate list of the most similar ones. We evaluate our approach on three popular projects hosted in GitHub, namely Rails, Elasticsearch and Angular.JS. The evaluation shows that about 55.3% -- 71.0% of the duplicates can be found when we use the combination of title similarity and description similarity.

[1]  Michael W. Godfrey,et al.  Investigating technical and non-technical factors influencing modern code review , 2015, Empirical Software Engineering.

[2]  James D. Herbsleb,et al.  Influence of social and technical factors for evaluating contribution in GitHub , 2014, ICSE.

[3]  Gang Yin,et al.  Determinants of pull-based development in the context of continuous integration , 2016, Science China Information Sciences.

[4]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[5]  Georgios Gousios,et al.  Work Practices and Challenges in Pull-Based Development: The Integrator's Perspective , 2014, ICSE.

[6]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[7]  Hajimu Iida,et al.  Who should review my code? A file location-based code-reviewer recommendation approach for Modern Code Review , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[8]  Michael E. Fagan Design and Code Inspections to Reduce Errors in Program Development , 1976, IBM Syst. J..

[9]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[10]  Gang Yin,et al.  Reviewer recommendation for pull-requests in GitHub: What can we learn from code review and bug assignment? , 2016, Inf. Softw. Technol..

[11]  Arie van Deursen,et al.  An exploratory study of the pull-based software development model , 2014, ICSE.

[12]  Bonita Sharif,et al.  Improving the accuracy of duplicate bug report detection using textual similarity measures , 2014, MSR 2014.

[13]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[14]  James D. Herbsleb,et al.  Let's talk about it: evaluating contributions through discussion in GitHub , 2014, SIGSOFT FSE.

[15]  Georgios Gousios,et al.  Automatically Prioritizing Pull Requests , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[16]  Georgios Gousios,et al.  Work practices and challenges in pull-based development: the contributor's perspective , 2015, ICSE.

[17]  Georgios Gousios,et al.  Chapter 9 – A Mixed Methods Approach to Mining Code Review Data: Examples and a Study of Multicommit Reviews and Pull Requests , 2015 .

[18]  David Lo,et al.  Duplicate bug report detection with a combination of information retrieval and topic modeling , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[19]  Daniel M. Germán,et al.  Cohesive and Isolated Development with Branches , 2012, FASE.

[20]  Haiyi Zhu,et al.  Effectiveness of Conflict Management Strategies in Peer Review Process of Online Collaboration Projects , 2016, CSCW.

[21]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[22]  Margaret-Anne D. Storey,et al.  Understanding broadcast based peer review on open source software projects , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[23]  Gang Yin,et al.  Reviewer Recommender of Pull-Requests in GitHub , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[24]  Gang Yin,et al.  Automatic Classification of Review Comments in Pull-based Development Model , 2017, SEKE.

[25]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[26]  Jia-Huan He,et al.  CoreDevRec: Automatic Core Member Recommendation for Contribution Evaluation , 2015, Journal of Computer Science and Technology.

[27]  Alberto Bacchelli,et al.  Expectations, outcomes, and challenges of modern code review , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[28]  David Lo,et al.  Multi-Factor Duplicate Question Detection in Stack Overflow , 2015, Journal of Computer Science and Technology.