An empirical study on developer interactions in StackOverflow

StackOverflow provides a popular platform where developers post and answer questions. Recently, Treude et al. manually label 385 questions in StackOverflow and group them into 10 categories based on their contents. They also analyze how tags are used in StackOverflow. In this study, we extend their work to obtain a deeper understanding on how developers interact with one another on such a question and answer web site. First, we analyze the distributions of developers who ask and answer questions. We also investigate if there is a segregation of the StackOverflow community into questioners and answerers. We also perform automated text mining to find the various kinds of topics asked by developers. We use Latent Dirichlet Allocation (LDA), a well known topic modeling approach, to analyze the contents of tens of thousands of questions and answers, and produce five topics. Our topic modeling strategy provides an alternative perspective different from that of Treude et al. for categorizing StackOverflow questions. Each question can now be categorized into several topics with different probabilities, and the learned topic model could automatically assign a new question to several categories with varying probabilities. Last but not least, we show the distributions of questions and developers belonging to various topics generated by LDA.

[1]  Christoph Treude,et al.  How do programmers ask and answer questions on the web?: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[2]  Walid Maalej,et al.  How do developers blog?: an exploratory study , 2011, MSR '11.

[3]  Gail C. Murphy,et al.  Automatic categorization of bug reports using latent Dirichlet allocation , 2012, ISEC.

[4]  Stephen E. Fienberg,et al.  Discriminative Topic Modeling Based on Manifold Learning , 2012, ACM Trans. Knowl. Discov. Data.

[5]  James D. Herbsleb,et al.  Social coding in GitHub: transparency and collaboration in an open software repository , 2012, CSCW.

[6]  Michele Lanza,et al.  Harnessing Stack Overflow for the IDE , 2012, 2012 Third International Workshop on Recommendation Systems for Software Engineering (RSSE).

[7]  Ahmed E. Hassan,et al.  Explaining software defects using topic models , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[8]  David Lo,et al.  What does software engineering community microblog about? , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[9]  Mira Mezini,et al.  Semi-automatically extracting FAQs to improve accessibility of software development knowledge , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[10]  Zhenchang Xing,et al.  Concern Localization using Information Retrieval: An Empirical Study on Linux Kernel , 2011, 2011 18th Working Conference on Reverse Engineering.

[11]  Deng Cai,et al.  Probabilistic dyadic data analysis with local and global consistency , 2009, ICML '09.

[12]  Daniel M. German,et al.  Towards understanding twitter use in software engineering: preliminary findings, ongoing challenges and future questions , 2011, Web2SE '11.

[13]  Bhavana S. Pansare,et al.  Information Needs in Bug Reports : Improving Cooperation between Developers and Users , 2015 .

[14]  David Lo,et al.  Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[15]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[16]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.