Mining Developer Behavior Across GitHub and StackOverflow

Nowadays, software developers are increasingly involved in GitHub and StackOverflow, creating a lot of valuable data in the two communities. Researchers mine the information in these software communities to understand developer behaviors, while previous work mainly focuses on mining data within a single community. In this paper, we propose a novel approach to mining developer behaviors across GitHub and StackOverflow. This approach links the accounts from two communities using a CART decision tree, leveraging the features from usernames, user behaviors and writing styles. Then, it explores cross-site developer behaviors through T-graph analysis, LDA-based topics clustering and cross-site tagging. We conducted several experiments to evaluate this approach. The results show that the precision and F-Score of our identity linkage method are higher than previous methods in software communities. Especially, we discovered that (1) active issue committers are also active question askers; (2) for most developers, the topics of their contents in GitHub are similar to that of their questions and answers in StackOverflow; (3) developers’ concerns in StackOverflow shift over the time of their current participating projects in GitHub; (4) developers’ concerns in GitHub are more relevant to their answers than questions and comments in StackOverflow.

[1]  Edgar Brunner,et al.  Rank-based multiple test procedures and simultaneous confidence intervals , 2012 .

[2]  Jiangang Zhu,et al.  TBIL: A Tagging-Based Approach to Identity Linkage Across Software Communities , 2015, 2015 Asia-Pacific Software Engineering Conference (APSEC).

[3]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[4]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Alexander Serebrenik,et al.  StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge , 2013, 2013 International Conference on Social Computing.

[6]  Christoph Treude,et al.  How do programmers ask and answer questions on the web?: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[7]  Ramayya Krishnan,et al.  HYDRA: large-scale social identity linkage via heterogeneous behavior modeling , 2014, SIGMOD Conference.

[8]  Tom Mens,et al.  On the variation and specialisation of workload—A case study of the Gnome ecosystem community , 2014, Empirical Software Engineering.

[9]  Dawn Xiaodong Song,et al.  On the Feasibility of Internet-Scale Author Identification , 2012, 2012 IEEE Symposium on Security and Privacy.

[10]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[11]  William N. Robinson,et al.  Developer Behavior and Sentiment from Data Mining Open Source Repositories , 2016, 2016 49th Hawaii International Conference on System Sciences (HICSS).

[12]  Alex Pentland,et al.  Pickard Time-Critical Social Mobilization , 2011 .

[13]  Alexander Serebrenik,et al.  Who's who in Gnome: Using LSA to merge software repository identities , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[14]  Tom Mens,et al.  A comparison of identity merge algorithms for software repositories , 2013, Sci. Comput. Program..

[15]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.