Scalable Relevant Project Recommendation on GitHub

GitHub, one of the largest social coding platforms, fosters a flexible and collaborative development process. In practice, developers in the open source software platform need to find projects relevant to their development work to reuse their function, explore ideas of possible features, or analyze the requirements for their projects. Recommending relevant projects to a developer is a difficult problem considering that there are millions of projects hosted on GitHub, and different developers may have different requirements on relevant projects. In this paper, we propose a scalable and personalized approach to recommend projects by leveraging both developers' behaviors and project features. Based on the features of projects created by developers and their behaviors to other projects, our approach automatically recommends top N most relevant software projects to developers. Moreover, to improve the scalability of our approach, we implement our approach in a parallel processing frame (i.e., Apache Spark) to analyze large-scale data on GitHub for efficient recommendation. We perform an empirical study on the data crawled from GitHub, and the results show that our approach can efficiently recommend relevant software projects with a relatively high precision fit for developers' interests.

[1]  Jun Wang,et al.  Unifying user-based and item-based collaborative filtering approaches by similarity fusion , 2006, SIGIR.

[2]  David Lo,et al.  Detecting similar repositories on GitHub , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[3]  Bing Xie,et al.  Recommending relevant projects via user behaviour: an exploratory study on github , 2014, CrowdSoft 2014.

[4]  Bracha Shapira,et al.  Recommender Systems Handbook , 2015, Springer US.

[5]  Miryung Kim,et al.  Titian: Data Provenance Support in Spark , 2015, Proc. VLDB Endow..

[6]  G. A. Marcoulides,et al.  Discovering Knowledge in Data: an Introduction to Data Mining , 2005 .

[7]  Bin Li,et al.  Using Feature-Interface Graph for Automatic Interface Recommendation: A Case Study , 2015, 2015 Third International Conference on Advanced Cloud and Big Data.

[8]  Lior Rokach,et al.  Introduction to Recommender Systems Handbook , 2011, Recommender Systems Handbook.

[9]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[10]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[11]  Michael J. Pazzani,et al.  Content-Based Recommendation Systems , 2007, The Adaptive Web.

[12]  Zekeriya Erkin,et al.  Privacy-preserving content-based recommender system , 2012, MM&Sec '12.

[13]  Benjamin C. M. Fung,et al.  Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering , 2016, KDD.

[14]  David Lo,et al.  Detecting similar applications with collaborative tagging , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[15]  Collin McMillan,et al.  Recommending source code for use in rapid software prototypes , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[16]  Bin Li,et al.  What Information in Software Historical Repositories Do We Need to Support Software Maintenance Tasks? An Approach Based on Topic Model , 2015, Computer and Information Science.

[17]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[18]  Gang Yin,et al.  RepoLike: personal repositories recommendation in social coding communities , 2016, Internetware.

[19]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[20]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[21]  Collin McMillan,et al.  Detecting similar software applications , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[22]  Wenyuan Xu,et al.  REPERSP: Recommending Personalized Software Projects on GitHub , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[23]  D. Larose k‐Nearest Neighbor Algorithm , 2005 .

[24]  Bin Li,et al.  Mining Software Repositories for Automatic Interface Recommendation , 2016, Sci. Program..

[25]  Junwu Zhu,et al.  Empirical studies on the NLP techniques for source code data preprocessing , 2014, EAST 2014.

[26]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..