Identifying Redundancies in Fork-based Development

Fork-based development is popular and easy to use, but makes it difficult to maintain an overview of the whole community when the number of forks increases. This may lead to redundant development where multiple developers are solving the same problem in parallel without being aware of each other. Redundant development wastes effort for both maintainers and developers. In this paper, we designed an approach to identify redundant code changes in forks as early as possible by extracting clues indicating similarities between code changes, and building a machine learning model to predict redundancies. We evaluated the effectiveness from both the maintainer’s and the developer’s perspectives. The result shows that we achieve 57–83% precision for detecting duplicate code changes from maintainer’s perspective, and we could save developers’ effort of 1.9–3.0 commits on average. Also, we show that our approach significantly outperforms existing state-of-art.

[1]  Bin Wang,et al.  Automated support for classifying software failure reports , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[2]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[3]  Maninder Singh,et al.  Semantic Code Clone Detection Using Parse Trees and Grammar Recovery , 2013 .

[4]  Gail E. Kaiser,et al.  A Case Study on the Impact of Similarity Measure on Information Retrieval based Software Engineering Tasks , 2018, ArXiv.

[5]  André van der Hoek,et al.  Palantir: Early Detection of Development Conflicts Arising from Parallel Code Changes , 2012, IEEE Transactions on Software Engineering.

[6]  Gang Yin,et al.  Detecting Duplicate Pull-requests in GitHub , 2017, Internetware.

[7]  Nicholas Jalbert,et al.  Automated duplicate detection for bug tracking systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[8]  Gang Yin,et al.  A Dataset of Duplicate Pull-Requests in GitHub , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[9]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[10]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[11]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[12]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[13]  James D. Herbsleb,et al.  Leveraging Transparency , 2013, IEEE Software.

[14]  Marco Aurélio Gerosa,et al.  The Power of Bots , 2018, Proc. ACM Hum. Comput. Interact..

[15]  Georgios Gousios,et al.  Work practices and challenges in pull-based development: the contributor's perspective , 2015, ICSE.

[16]  Krzysztof Czarnecki,et al.  An Exploratory Study of Cloning in Industrial Software Product Lines , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[17]  Kelly Blincoe,et al.  The Sky Is Not the Limit: Multitasking Across GitHub Projects , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[18]  Neil A. Ernst,et al.  Code forking in open-source software: a requirements perspective , 2010, ArXiv.

[19]  Ciera Jaspan,et al.  Lessons from building static analysis tools at Google , 2018, Commun. ACM.

[20]  Xin Zhang,et al.  How do Multiple Pull Requests Change the Same Code: A Study of Competing Pull Requests in GitHub , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[21]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[22]  Hongyu Zhang,et al.  Has this bug been reported? , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[23]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[24]  Lyndon Hiew,et al.  Assisted Detection of Duplicate Bug Reports , 2006 .

[25]  Andrzej Wasowski,et al.  Forked and integrated variants in an open-source firmware project , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[26]  Arie van Deursen,et al.  An exploratory study of the pull-based software development model , 2014, ICSE.

[27]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[28]  Tao Xie,et al.  JDF: detecting duplicate bug reports in Jazz , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[29]  Jugal K. Kalita,et al.  Semantic Clone Detection Using Machine Learning , 2016, 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA).

[30]  Jian Zhou,et al.  Learning to rank duplicate bug reports , 2012, CIKM.

[31]  Gabriele Bavota,et al.  Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[32]  Cristina V. Lopes,et al.  Oreo: detection of clones in the twilight zone , 2018, ESEC/SIGSOFT FSE.

[33]  Marco Aurélio Gerosa,et al.  Almost There: A Study on Quasi-Contributors in Open-Source Software Projects , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[34]  Andrzej Wasowski,et al.  Identifying Features in Forks , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[35]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[36]  Jürgen Bitzer,et al.  The Impact of Entry and Competition by Open Source Software on Innovation , 2005 .

[37]  Yuriy Brun,et al.  Proactive detection of collaboration conflicts , 2011, ESEC/FSE '11.

[38]  Sandeep K. Singh,et al.  Performance evaluation of VSM and LSI models to determine bug reports similarity , 2013, 2013 Sixth International Conference on Contemporary Computing (IC3).

[39]  Thomas Zimmermann,et al.  Duplicate bug reports considered harmful … really? , 2008, 2008 IEEE International Conference on Software Maintenance.

[40]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[41]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[42]  Chanchal Kumar Roy,et al.  Mining Duplicate Questions of Stack Overflow , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[43]  Greg R. Vetter Open Source Licensing and Scattering Opportunism in Software Standards , 2007 .

[44]  James D. Herbsleb,et al.  Social coding in GitHub: transparency and collaboration in an open software repository , 2012, CSCW.

[45]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[46]  Hong Mei,et al.  A survey on bug-report analysis , 2015, Science China Information Sciences.