Abstract The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential resuse. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.
[1]
R. Holmes,et al.
Using structural context to recommend source code examples
,
2005,
Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..
[2]
Flaviu Ghitulescu,et al.
Google Code Search
,
2006
.
[3]
Emden R. Gansner,et al.
A C++ data model supporting reachability analysis and dead code detection
,
1997,
ESEC '97/FSE-5.
[4]
Oscar Nierstrasz,et al.
The story of moose: an agile reengineering environment
,
2005,
ESEC/FSE-13.
[5]
Serge Demeyer,et al.
FAMIX 2. 1-the FAMOOS information exchange model
,
1999
.
[6]
Shinji Kusumoto,et al.
Ranking significance of software components based on use relations
,
2003,
IEEE Transactions on Software Engineering.
[7]
Tao Xie,et al.
SpotWeb: detecting framework hotspots via mining open source repositories on the web
,
2008,
MSR '08.
[8]
Sushil Krishna Bajracharya,et al.
A theory of aspects as latent topics
,
2008,
OOPSLA.
[9]
Mark A. Linton,et al.
Implementing relational views of programs
,
1984,
SDE 1.
[10]
Amir Michail,et al.
Data mining library reuse patterns using generalized association rules
,
2000,
Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.