Software Analysis with Unsupervised Topic Models

We provide an overview of our work in applying unsupervised topic and authortopic models based on Latent Dirichlet Allocation (LDA) to the problem of mining large software repositories at multiple levels of granularity. Our approaches allow us to automatically discover the topics embedded in code and extract documenttopic and author-topic distributions. In addition to serving as a convenient summary for program content and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing software complexity, developer similarity, and the evolution of software over the release timeline.