A Linked Data platform for mining software repositories

The mining of software repositories involves the extraction of both basic and value-added information from existing software repositories. The repositories will be mined to extract facts by different stakeholders (e.g. researchers, managers) and for various purposes. To avoid unnecessary pre-processing and analysis steps, sharing and integration of both basic and value-added facts are needed. In this research, we introduce SeCold, an open and collaborative platform for sharing software datasets. SeCold provides the first online software ecosystem Linked Data platform that supports data extraction and on-the-fly inter-dataset integration from major version control, issue tracking, and quality evaluation systems. In its first release, the dataset contains about two billion facts, such as source code statements, software licenses, and code clones from 18 000 software projects. In its second release the SeCold project will contain additional facts mined from issue trackers and versioning systems. Our approach is based on the same fundamental principle as Wikipedia: researchers and tool developers share analysis results obtained from their tools by publishing them as part of the SeCold portal and therefore make them an integrated part of the global knowledge domain. The SeCold project is an official member of the Linked Data dataset cloud and is currently the eighth largest online dataset available on the Web.

[1]  C. Bizer,et al.  Enabling Tailored Therapeutics with Linked Data , 2009 .

[2]  A.E. Hassan,et al.  The road ahead for Mining Software Repositories , 2008, 2008 Frontiers of Software Maintenance.

[3]  Iman Keivanloo,et al.  Towards sharing source code facts using linked data , 2011, SUITE '11.

[4]  Tim Berners-Lee,et al.  Linked data , 2020, Semantic Web for the Working Ontologist.

[6]  Stéphane Ducasse,et al.  FAMIX and XMI , 2000, Proceedings Seventh Working Conference on Reverse Engineering.

[7]  Joel Ossher,et al.  Sourcerer: An internet-scale software repository , 2009, 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation.

[8]  Karl Trygve Kalleberg,et al.  Finding software license violations through binary code clone detection , 2011, MSR '11.

[9]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[10]  Christoph Lange,et al.  Publishing Math Lecture Notes as Linked Data , 2010, ESWC.

[11]  Gerald Reif,et al.  Fostering synergies: how semantic web technology could influence software repositories , 2010, SUITE '10.