An automated approach to assess the similarity of GitHub repositories

Open source software (OSS) allows developers to study, change, and improve the code free of charge. There are several high-quality software projects which deliver stable and well-documented products. Most OSS forges typically sustain active user and expert communities which in turn provide decent levels of support both with respect to answering user questions as well as to repairing reported software bugs. Code reuse is an intrinsic feature of OSS, and developing a new system by leveraging existing open source components can reduce development effort, and thus it can be beneficial to at least two phases of the software life cycle, i.e., implementation and maintenance. However, to improve software quality, it is essential to develop a system by learning from well-defined, mature projects. In this sense, the ability to find similar projects that facilitate the undergoing development activities is of high importance. In this paper, we address the issue of mining open source software repositories to detect similar projects, which can be eventually reused by developers. We propose CrossSim as a novel approach to model the OSS ecosystem and to compute similarities among software projects. An evaluation on a dataset collected from GitHub shows that our proposed approach outperforms three well-established baselines.

[1]  Giorgios Kollias,et al.  Fast parallel algorithms for graph similarity and matching , 2014, J. Parallel Distributed Comput..

[2]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[3]  A. Tversky Features of Similarity , 1977 .

[4]  Dan Frankowski,et al.  Collaborative Filtering Recommender Systems , 2007, The Adaptive Web.

[5]  Juri Di Rocco,et al.  Mining Software Repositories to Support OSS Developers: A Recommender Systems Approach , 2018, IIR.

[6]  Christian Bizer,et al.  Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Connections , 2009, ESWC.

[7]  David Lo,et al.  Detecting similar repositories on GitHub , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[8]  Victor I. Chang,et al.  A survey on Test Suite Reduction frameworks and tools , 2016, Int. J. Inf. Manag..

[9]  Juri Di Rocco,et al.  Knowledge-aware Recommender System for Software Development , 2018, KaRS@RecSys.

[10]  David Lo,et al.  Why and how developers fork what from whom in GitHub , 2017, Empirical Software Engineering.

[11]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[12]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[13]  Jens Krinke,et al.  A picture is worth a thousand words: Code clone detection based on image similarity , 2018, 2018 IEEE 12th International Workshop on Software Clones (IWSC).

[14]  Massimiliano Di Penta,et al.  FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[15]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[16]  Anindya Datta,et al.  Serendipitous Recommendation for Mobile Apps Using Item-Item Similarity Graph , 2013, AIRS.

[17]  Shahin Mohammadi,et al.  Low Rank Spectral Network Alignment , 2018, WWW.

[18]  Christoph Treude,et al.  SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[19]  S. Ghose,et al.  Taste tests: Impacts of consumer perceptions and preferences on brand positioning strategies , 2001 .

[20]  Mohammad El-Ramly,et al.  Similarity in Programs , 2006, Duplication, Redundancy, and Similarity in Software.

[21]  David Lo,et al.  Tag recommendation in software information sites , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[22]  David Lo,et al.  Detecting similar applications with collaborative tagging , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[23]  Markus Zanker,et al.  Linked open data to support content-based recommender systems , 2012, I-SEMANTICS '12.

[24]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[25]  Juri Di Rocco,et al.  Enabling heterogeneous recommendations in OSS development: what's done and what's next in CROSSMINER , 2019, EASE.

[26]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[27]  Gail E. Kaiser,et al.  An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[28]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[29]  Diomidis Spinellis,et al.  Developer-Centric Knowledge Mining from Large Open-Source Software Repositories (CROSSMINER) , 2017, STAF Workshops.

[30]  David Clark,et al.  A comparison of code similarity analysers , 2018, Empirical Software Engineering.

[31]  Barry W. Boehm,et al.  Towards Better Understanding of Software Quality Evolution through Commit-Impact Analysis , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[32]  Katsuro Inoue,et al.  MUDABlue: An Automatic Categorization System for Open Source Repositories , 2004, APSEC.

[33]  Katsuro Inoue,et al.  MUDABlue: an automatic categorization system for open source repositories , 2004, 11th Asia-Pacific Software Engineering Conference.

[34]  David Lo,et al.  Automated library recommendation , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[35]  António Menezes Leitão Detection of Redundant Code Using R2D2 , 2004, Software Quality Journal.

[36]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[37]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[38]  Rainer Koschke,et al.  An extended assessment of type-3 clones as detected by state-of-the-art tools , 2011, Software Quality Journal.

[39]  Meng Wang,et al.  Community Detection in Social Networks: An In-depth Benchmarking Study with a Procedure-Oriented Framework , 2015, Proc. VLDB Endow..

[40]  Paul Van Dooren,et al.  A MEASURE OF SIMILARITY BETWEEN GRAPH VERTICES . WITH APPLICATIONS TO SYNONYM EXTRACTION AND WEB SEARCHING , 2002 .

[41]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[42]  Haoyu Wang,et al.  WuKong: a scalable and accurate two-phase approach to Android app clone detection , 2015, ISSTA.

[43]  Ning Chen,et al.  SimApp: A Framework for Detecting Similar Mobile Applications by Online Kernel Learning , 2015, WSDM.

[44]  D. Spinellis,et al.  How is open source affecting software development? , 2004, IEEE Software.

[45]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[46]  Mario Linares Vásquez,et al.  On automatically detecting similar Android apps , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[47]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[48]  Patrícia Duarte de Lima Machado,et al.  Analysis of distance functions for similarity-based test suite reduction in the context of model-based testing , 2014, Software Quality Journal.

[49]  C. Lee Giles,et al.  What's the code?: automatic classification of source code archives , 2002, KDD.

[50]  Gabriele Bavota,et al.  Mining StackOverflow to turn the IDE into a self-confident programming prompter , 2014, MSR 2014.

[51]  Christopher W. Fraser,et al.  Clone detection via structural abstraction , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[52]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[53]  Jens Lehmann,et al.  LinkedGeoData: A core for a web of spatial open data , 2012, Semantic Web.

[54]  Tasting as a projective technique , 2008 .

[55]  Phuong Nguyen,et al.  An evaluation of SimRank and Personalized PageRank to build a recommender system for the Web of Data , 2015, WWW.

[56]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[57]  Juri Di Rocco,et al.  CrossSim: Exploiting Mutual Relationships to Detect Similar OSS Projects , 2018, 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA).

[58]  Marco Tulio Valente,et al.  Understanding the Factors That Impact the Popularity of GitHub Repositories , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[59]  Collin McMillan,et al.  Detecting similar software applications , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[60]  Hao Chen,et al.  AnDarwin: Scalable Detection of Semantically Similar Android Applications , 2013, ESORICS.