CAPS: a supervised technique for classifying Stack Overflow posts concerning API issues

The design and maintenance of APIs (Application Programming Interfaces) are complex tasks due to the constantly changing requirements of their users. Despite the efforts of their designers, APIs may suffer from a number of issues (such as incomplete or erroneous documentation, poor performance, and backward incompatibility). To maintain a healthy client base, API designers must learn these issues to fix them. Question answering sites, such as Stack Overflow (SO), have become a popular place for discussing API issues. These posts about API issues are invaluable to API designers, not only because they can help to learn more about the problem but also because they can facilitate learning the requirements of API users. However, the unstructured nature of posts and the abundance of non-issue posts make the task of detecting SO posts concerning API issues difficult and challenging. In this paper, we first develop a supervised learning approach using a Conditional Random Field (CRF), a statistical modeling method, to identify API issue-related sentences. We use the above information together with different features collected from posts, the experience of users, readability metrics and centrality measures of collaboration network to build a technique, called CAPS , that can classify SO posts concerning API issues. In total, we consider 34 features along eight different dimensions. Evaluation of CAPS using carefully curated SO posts on three popular API types reveals that the technique outperforms all three baseline approaches we consider in this study. We then conduct studies to find important features and also evaluate the performance of the CRF-based technique for classifying issue sentences. Comparison with two other baseline approaches shows that the technique has high potential. We also test the generalizability of CAPS results, evaluate the effectiveness of different classifiers, and identify the impact of different feature sets.

[1]  Michele Marchesi,et al.  Would you mind fixing this issue? - An Empirical Analysis of Politeness and Attractiveness in Software Developed Using Agile Boards , 2015, XP.

[2]  Michele Lanza,et al.  Harnessing Stack Overflow for the IDE , 2012, 2012 Third International Workshop on Recommendation Systems for Software Engineering (RSSE).

[3]  Benjamin V. Hanrahan,et al.  Modeling problem difficulty and expertise in stackoverflow , 2012, CSCW.

[4]  Bonita Sharif,et al.  Analyzing Developer Sentiment in Commit Logs , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[5]  Markus Neuhäuser,et al.  Wilcoxon-Mann-Whitney Test , 2011, International Encyclopedia of Statistical Science.

[6]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[7]  Grzegorz Chrupala,et al.  Predicting the quality of questions on Stackoverflow , 2015, RANLP.

[8]  Ahmed E. Hassan,et al.  What are developers talking about? An analysis of topics and trends in Stack Overflow , 2014, Empirical Software Engineering.

[9]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[10]  Shane McIntosh,et al.  An empirical study of the impact of modern code review practices on software quality , 2015, Empirical Software Engineering.

[11]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[12]  Scott Grant,et al.  Estimating the Optimal Number of Latent Concepts in Source Code Analysis , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[13]  Ingo Scholtes,et al.  Categorizing bugs with social networks: A case study on four open source software communities , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[14]  Chanchal Kumar Roy,et al.  Classifying stack overflow posts on API issues , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[15]  Nicole Novielli,et al.  The challenges of sentiment detection in the social programmer ecosystem , 2015, SSE@SIGSOFT FSE.

[16]  Christoph Treude,et al.  How do programmers ask and answer questions on the web?: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[17]  Anindya Iqbal,et al.  SentiCR: A customized sentiment analysis tool for code review interactions , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  R. Gunning The Technique of Clear Writing. , 1968 .

[19]  Alexander Serebrenik,et al.  Security and emotion: sentiment analysis of security discussions on GitHub , 2014, MSR 2014.

[20]  Phillip Bonacich,et al.  Eigenvector-like measures of centrality for asymmetric relations , 2001, Soc. Networks.

[21]  Gabriele Bavota,et al.  How do API changes trigger stack overflow discussions? a study on the Android SDK , 2014, ICPC 2014.

[22]  Alexander Serebrenik,et al.  On negative results when using sentiment analysis tools for software engineering research , 2017, Empirical Software Engineering.

[23]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[24]  Chanchal Kumar Roy,et al.  Useful, But Usable? Factors Affecting the Usability of APIs , 2011, 2011 18th Working Conference on Reverse Engineering.

[25]  Eleni Stroulia,et al.  On the Personality Traits of StackOverflow Users , 2013, 2013 IEEE International Conference on Software Maintenance.

[26]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[27]  David Lo,et al.  Chaff from the Wheat: Characterizing and Determining Valid Bug Reports , 2020, IEEE Transactions on Software Engineering.

[28]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[29]  Ngoc Phuoc An Vo,et al.  Identifying User Issues and Request Types in Forum Question Posts Based on Discourse Analysis , 2016, WWW.

[30]  Jimmy J. Lin,et al.  What Works Better for Question Answering: Stemming or Morphological Query Expansion? , 2004 .

[31]  Mikko Kurimo,et al.  Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy , 2014, ACL.

[32]  Nicole Novielli,et al.  A Benchmark Study on Sentiment Analysis for Software Engineering Research , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[33]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[34]  Preethi Raghavan,et al.  Extracting Problem and Resolution Information from Online Discussion Forums , 2010, COMAD.

[35]  Martin P. Robillard,et al.  Discovering Information Explaining API Types Using Text Classification , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[36]  Yang Li,et al.  Sentiment analysis of commit comments in GitHub: an empirical study , 2014, MSR 2014.

[37]  Thomas Zimmermann,et al.  What Makes a Good Bug Report? , 2010, IEEE Trans. Software Eng..

[38]  Minhaz Fahim Zibran,et al.  Leveraging Automated Sentiment Analysis in Software Engineering , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[39]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[40]  Harald C. Gall,et al.  How can i improve my app? Classifying user reviews for software maintenance and evolution , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[41]  Zhenchang Xing,et al.  Mining Analogical Libraries in Q&A Discussions -- Incorporating Relational and Categorical Knowledge into Word Embedding , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[42]  Nicole Novielli,et al.  Sentiment Polarity Detection for Software Development , 2017, Empirical Software Engineering.

[43]  Paul D. Allison,et al.  Logistic regression using sas®: theory and application , 1999 .

[44]  Michael W. Godfrey,et al.  Recommending Posts concerning API Issues in Developer Q&A Sites , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[45]  Ali Mesbah,et al.  Mining questions asked by web developers , 2014, MSR 2014.

[46]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[47]  Martin P. Robillard,et al.  What Makes APIs Hard to Learn? Answers from Developers , 2009, IEEE Software.

[48]  Ashish Sureka,et al.  Chaff from the wheat: characterization and modeling of deleted questions on stack overflow , 2014, WWW.

[49]  Michele Lanza,et al.  Improving Low Quality Stack Overflow Post Detection , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[50]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[51]  Alberto Bacchelli,et al.  Content classification of development emails , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[52]  Ashish Sureka,et al.  Fit or unfit: analysis and prediction of 'closed questions' on stack overflow , 2013, COSN '13.

[53]  Lin Li,et al.  Obstacles in Using Frameworks and APIs: An Exploratory Study of Programmers' Newsgroup Discussions , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[54]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[55]  Michael W. Godfrey,et al.  Detecting API usage obstacles: A study of iOS and Android developer questions , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[56]  Marcelo Serrano Zanetti,et al.  The Role of Emotions in Contributors Activity: A Case Study on the GENTOO Community , 2013, 2013 International Conference on Cloud and Green Computing.

[57]  Gabriele Bavota,et al.  Mining StackOverflow to turn the IDE into a self-confident programming prompter , 2014, MSR 2014.

[58]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[59]  Martin P. Robillard,et al.  A field study of API learning obstacles , 2011, Empirical Software Engineering.

[60]  ChengXiang Zhai,et al.  Learning online discussion structures by conditional random fields , 2011, SIGIR.

[61]  Eleni Stroulia,et al.  Detecting duplicate bug reports with software engineering domain knowledge , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[62]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[63]  Xiaoyan Zhu,et al.  Using Conditional Random Fields to Extract Contexts and Answers of Questions from Online Forums , 2008, ACL.

[64]  Romain Robbes,et al.  How do developers react to API deprecation?: the case of a smalltalk ecosystem , 2012, SIGSOFT FSE.

[65]  Yingying Zhang,et al.  Extracting problematic API features from forum discussions , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[66]  Foutse Khomh,et al.  Automatic summarization of API reviews , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[67]  Chanchal Kumar Roy,et al.  Answering questions about unanswered questions of Stack Overflow , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).