Detecting similar repositories on GitHub

GitHub contains millions of repositories among which many are similar with one another (i.e., having similar source codes or implementing similar functionalities). Finding similar repositories on GitHub can be helpful for software engineers as it can help them reuse source code, build prototypes, identify alternative implementations, explore related projects, find projects to contribute to, and discover code theft and plagiarism. Previous studies have proposed techniques to detect similar applications by analyzing API usage patterns and software tags. However, these prior studies either only make use of a limited source of information or use information not available for projects on GitHub. In this paper, we propose a novel approach that can effectively detect similar repositories on GitHub. Our approach is designed based on three heuristics leveraging two data sources (i.e., GitHub stars and readme files) which are not considered in previous works. The three heuristics are: repositories whose readme files contain similar contents are likely to be similar with one another, repositories starred by users of similar interests are likely to be similar, and repositories starred together within a short period of time by the same user are likely to be similar. Based on these three heuristics, we compute three relevance scores (i.e., readme-based relevance, stargazer-based relevance, and time-based relevance) to assess the similarity between two repositories. By integrating the three relevance scores, we build a recommendation system called RepoPal to detect similar repositories. We compare RepoPal to a prior state-of-the-art approach CLAN using one thousand Java repositories on GitHub. Our empirical evaluation demonstrates that RepoPal achieves a higher success rate, precision and confidence over CLAN.

[1]  Sushil Krishna Bajracharya,et al.  Leveraging usage similarity for effective retrieval of examples in code repositories , 2010, FSE '10.

[2]  Elmar Jürgens,et al.  A Novel Approach to Detect Unintentional Re-implementations , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[3]  Steven P. Reiss,et al.  Semantics-based code search , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[4]  Collin McMillan,et al.  A search engine for finding highly relevant applications , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[5]  David Lo,et al.  Combining Software Metrics and Text Features for Vulnerable File Prediction , 2015, 2015 20th International Conference on Engineering of Complex Computer Systems (ICECCS).

[6]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[7]  Zhenchang Xing,et al.  Predicting semantically linkable knowledge in developer online forums via convolutional neural network , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[8]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[9]  David Lo,et al.  An effective change recommendation approach for supplementary bug fixes , 2017, Automated Software Engineering.

[10]  Gail C. Murphy,et al.  Hipikat: recommending pertinent software development artifacts , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[11]  Collin McMillan,et al.  Recommending source code for use in rapid software prototypes , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[12]  Tao Xie,et al.  SpotWeb: Detecting Framework Hotspots and Coldspots via Mining Open Source Code on the Web , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[13]  David Lo,et al.  Automated library recommendation , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[14]  Alexander Serebrenik,et al.  StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge , 2013, 2013 International Conference on Social Computing.

[15]  David Lo,et al.  Popularity, Interoperability, and Impact of Programming Languages in 100,000 Open Source Projects , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[16]  Sushil Krishna Bajracharya,et al.  Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[17]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[18]  Collin McMillan,et al.  Exemplar: A Source Code Search Engine for Finding Highly Relevant Applications , 2012, IEEE Transactions on Software Engineering.

[19]  Georgios Gousios,et al.  Lean GHTorrent: GitHub data on demand , 2014, MSR 2014.

[20]  Tao Xie,et al.  Parseweb: a programmer assistant for reusing open source code on the web , 2007, ASE.

[21]  Koushik Sen,et al.  SNIFF: A Search Engine for Java Using Free-Form Queries , 2009, FASE.

[22]  Katsuro Inoue,et al.  MUDABlue: an automatic categorization system for open source repositories , 2004, 11th Asia-Pacific Software Engineering Conference.

[23]  Collin McMillan,et al.  Detecting similar software applications , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[24]  Premkumar T. Devanbu,et al.  Gender and Tenure Diversity in GitHub Teams , 2015, CHI.

[25]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[26]  David Lo,et al.  Automated Configuration Bug Report Prediction Using Text Mining , 2014, 2014 IEEE 38th Annual Computer Software and Applications Conference.

[27]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[28]  Premkumar T. Devanbu,et al.  A large scale study of programming languages and code quality in github , 2014, SIGSOFT FSE.

[29]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[30]  David Lo,et al.  Accurate developer recommendation for bug resolution , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[31]  Claus-Peter Richter,et al.  Pressure in the Cochlea During Infrared Irradiation , 2018, IEEE Transactions on Biomedical Engineering.

[32]  Xavier Blanc,et al.  Automatic discovery of function mappings between similar libraries , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[33]  A. Mockus,et al.  Large-Scale Code Reuse in Open Source Software , 2007, First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS'07: ICSE Workshops 2007).

[34]  Premkumar T. Devanbu,et al.  Assert Use in GitHub Projects , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[35]  Gang Yin,et al.  Mining Software Profile across Multiple Repositories for Hierarchical Categorization , 2013, 2013 IEEE International Conference on Software Maintenance.

[36]  David Lo,et al.  Improving Automated Bug Triaging with Specialized Topic Model , 2017, IEEE Transactions on Software Engineering.

[37]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[38]  David Lo,et al.  Detecting similar applications with collaborative tagging , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[39]  Abraham Bernstein,et al.  Detecting similar Java classes using tree algorithms , 2006, MSR '06.