Mining Internet-Scale Software Repositories

Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the AUC metric to 0.84- roughly 10-30% better than previous approaches based on text alone. Supplementary material may be found at: http://sourcerer.ics.uci.edu/nips2007/nips07.html.

[1]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[2]  Cristina V. Lopes,et al.  Aspect-oriented programming , 1999, ECOOP Workshops.

[3]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[5]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[6]  Wray L. Buntine Open source search: a data mining platform , 2005, SIGF.

[7]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[8]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[9]  C. Lee Giles,et al.  What's the code?: automatic classification of source code archives , 2002, KDD.

[10]  A. Zeller,et al.  If Your Bug Database Could Talk . . . , 2006 .

[11]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[12]  David J. Newman,et al.  Probabilistic topic decomposition of an eighteenth-century American newspaper , 2006, J. Assoc. Inf. Sci. Technol..

[13]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[14]  Cristina V. Lopes,et al.  Aspect-oriented programming , 1999, ECOOP Workshops.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .