On negative results when using sentiment analysis tools for software engineering research

Recent years have seen an increasing attention to social aspects of software engineering, including studies of emotions and sentiments experienced and expressed by the software developers. Most of these studies reuse existing sentiment analysis tools such as SentiStrength and NLTK. However, these tools have been trained on product reviews and movie reviews and, therefore, their results might not be applicable in the software engineering domain. In this paper we study whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool on software engineering studies by conducting a simple study of differences in issue resolution times for positive, negative and neutral texts. We repeat the study for seven datasets (issue trackers and Stack Overflow questions) and different sentiment analysis tools and observe that the disagreement between the tools can lead to diverging conclusions. Finally, we perform two replications of previously published studies and observe that the results of those studies cannot be confirmed when a different sentiment analysis tool is used.

[1]  Nicole Novielli,et al.  The challenges of sentiment detection in the social programmer ecosystem , 2015, SSE@SIGSOFT FSE.

[2]  Agile Manifesto,et al.  Manifesto for Agile Software Development , 2001 .

[3]  Jeffrey C. Carver,et al.  The role of replications in Empirical Software Engineering , 2008, Empirical Software Engineering.

[4]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[5]  Mike Thelwall,et al.  Sentiment strength detection for the social web , 2012, J. Assoc. Inf. Sci. Technol..

[6]  Welf Löwe,et al.  Quantitative Evaluation of Software Quality Metrics in Open-Source Projects , 2009, 2009 International Conference on Advanced Information Networking and Applications Workshops.

[7]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[8]  Michael Goul,et al.  Managing the Enterprise Business Intelligence App Store: Sentiment Analysis Supported Requirements Engineering , 2012, 2012 45th Hawaii International Conference on System Sciences.

[9]  YuYue,et al.  Reviewer recommendation for pull-requests in GitHub , 2016 .

[10]  Arvid Kappas,et al.  Sentiment in short strength detection informal text , 2010, J. Assoc. Inf. Sci. Technol..

[11]  E. Brunner,et al.  The Nonparametric Behrens‐Fisher Problem: Asymptotic Theory and a Small‐Sample Approximation , 2000 .

[12]  M. Cugmas,et al.  On comparing partitions , 2015 .

[13]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[14]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[15]  Patrick Paroubek,et al.  Twitter Based System: Using Twitter for Disambiguating Sentiment Ambiguous Adjectives , 2010, *SEMEVAL.

[16]  Mark R. Lindsey What went wrong?: negative results from VoIP service providers , 2011, IPTComm 2011.

[17]  Jacek Czerwonka,et al.  Code Ownership and Software Quality: A Replication Study , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[18]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[19]  Michele Marchesi,et al.  Are Bullies More Productive? Empirical Study of Affectiveness vs. Issue Fixing Time , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[20]  Saif Mohammad,et al.  NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets , 2013, *SEMEVAL.

[21]  Michele Marchesi,et al.  The JIRA Repository Dataset: Understanding Social Aspects of Software Development , 2015, PROMISE.

[22]  Michele Marchesi,et al.  Software development: do good manners matter? , 2016, PeerJ Comput. Sci..

[23]  Michele Lanza,et al.  9th IEEE Working Conference o Mining Software Repositories, MSR 2012, June 2-3, 2012, , 2012 .

[24]  Cristina V. Lopes,et al.  Trendy bugs: Topic trends in the Android bug reports , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[25]  Alexander Serebrenik,et al.  Security and emotion: sentiment analysis of security discussions on GitHub , 2014, MSR 2014.

[26]  Berkant Barla Cambazoglu,et al.  A large-scale sentiment analysis for Yahoo! answers , 2012, WSDM '12.

[27]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[28]  Ioannis Stamelos,et al.  Investigating the Impact of Personality and Temperament Traits on Pair Programming: A Controlled Experiment Replication , 2012, 2012 Eighth International Conference on the Quality of Information and Communications Technology.

[29]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[30]  Timo Honkela,et al.  Text Mining for Wellbeing: Selecting Stories Using Semantic and Pragmatic Features , 2012, ICANN.

[31]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[32]  Yang Li,et al.  Sentiment analysis of commit comments in GitHub: an empirical study , 2014, MSR 2014.

[33]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[34]  Alexander Serebrenik,et al.  Simulink models are also software: modularity assessment , 2013, QoSA '13.

[35]  Nickolas J. G. Falkner,et al.  The Development of a Dashboard Tool for Visualising Online Teamwork Discussions , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[36]  Lori L. Pollock,et al.  Automatically mining software-based, semantically-similar words from comment-code mappings , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[37]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[38]  K. Gabriel,et al.  SIMULTANEOUS TEST PROCEDURES-SOME THEORY OF MULTIPLE COMPARISONS' , 1969 .

[39]  Marcelo Serrano Zanetti,et al.  The Role of Emotions in Contributors Activity: A Case Study on the GENTOO Community , 2013, 2013 International Conference on Cloud and Green Computing.

[40]  Francesca Arcelli Fontana,et al.  An Experience Report on Using Code Smells Detection Tools , 2011, 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops.

[41]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Applying One-Sided Selection to Unbalanced Datasets , 2000, MICAI.

[42]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[43]  Bikram Sengupta,et al.  Talk versus work: characteristics of developer collaboration on the jazz platform , 2012, OOPSLA '12.

[44]  Pierre Baldi,et al.  Mining the coherence of GNOME bug reports with statistical topic models , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[45]  Harald C. Gall,et al.  How can i improve my app? Classifying user reviews for software maintenance and evolution , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[46]  Tom Mens,et al.  On the variation and specialisation of workload—A case study of the Gnome ecosystem community , 2014, Empirical Software Engineering.

[47]  Yasutaka Kamei,et al.  Mining challenge 2012: The Android platform , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[48]  James D. Herbsleb,et al.  Communication networks in geographically distributed software development , 2008, CSCW.

[49]  Bram Adams,et al.  Do developers feel emotions? an exploratory analysis of emotions in software artifacts , 2014, MSR 2014.

[50]  Thilo Mende,et al.  Replication of defect prediction studies: problems, pitfalls and recommendations , 2010, PROMISE '10.

[51]  Mark J. Embrechts,et al.  On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification , 2009, ICANN.

[52]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[53]  Eric K. Ringger,et al.  Pulse: Mining Customer Opinions from Free Text , 2005, IDA.

[54]  Prasun Dewan,et al.  Towards Emotion-Based Collaborative Software Engineering , 2015, 2015 IEEE/ACM 8th International Workshop on Cooperative and Human Aspects of Software Engineering.

[55]  Mike Thelwall,et al.  Sentiment in short strength detection informal text , 2010 .

[56]  Gilad Mishne,et al.  Predicting Movie Sales from Blogger Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[57]  W. Grove Statistical Methods for Rates and Proportions, 2nd ed , 1981 .

[58]  Cleidson R. B. de Souza,et al.  The scale and evolution of coordination needs in large-scale distributed projects: implications for the future generation of collaborative tools , 2011, CHI.

[59]  Edgar Brunner,et al.  Rank-based multiple test procedures and simultaneous confidence intervals , 2012 .

[60]  Marco Torchiano,et al.  Empirical studies in reverse engineering: state of the art and future trends , 2007, Empirical Software Engineering.

[61]  David Lo,et al.  EnTagRec++: An enhanced tag recommendation system for software information sites , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[62]  Gang Yin,et al.  Reviewer recommendation for pull-requests in GitHub: What can we learn from code review and bug assignment? , 2016, Inf. Softw. Technol..

[63]  Ahmed Abbasi,et al.  Benchmarking Twitter Sentiment Analysis Tools , 2014, LREC.

[64]  Ronnie E. S. Santos,et al.  Investigations about replication of empirical studies in software engineering: preliminary findings from a mapping study , 2014, EASE '14.

[65]  Alexander Serebrenik,et al.  By no means: a study on aggregating software metrics , 2011, WETSoM '11.

[66]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[67]  Margaret H. Dunham,et al.  On the importance of sharing negative results , 2011, SKDD.

[68]  Bernd Brügge,et al.  Towards emotional awareness in software development teams , 2013, ESEC/FSE 2013.

[69]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[70]  Alexander Serebrenik,et al.  Choosing your weapons: On sentiment analysis tools for software engineering research , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[71]  Philip Smith Roger Bakeman John M. Gottman , 1987, Animal Behaviour.

[72]  9th IEEE Working Conference of Mining Software Repositories, MSR 2012, June 2-3, 2012, Zurich, Switzerland , 2012, MSR.

[73]  Daniela E. Damian,et al.  To talk or not to talk: factors that influence communication around changesets , 2012, CSCW.

[74]  Norbert Fuhr,et al.  Probabilistic search term weighting - some negative results , 1987, SIGIR '87.

[75]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[76]  Chanchal Kumar Roy,et al.  Bug introducing changes: A case study with Android , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[77]  Alexander Serebrenik,et al.  StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge , 2013, 2013 International Conference on Social Computing.

[78]  Bram Adams,et al.  Monitoring sentiment in open source mailing lists: exploratory study on the apache ecosystem , 2014, CASCON.

[79]  Minhaz Fahim Zibran,et al.  Towards understanding and exploiting developers' emotional variations in software engineering , 2016, 2016 IEEE 14th International Conference on Software Engineering Research, Management and Applications (SERA).

[80]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[81]  Ari Rappoport,et al.  Semi-Supervised Recognition of Sarcasm in Twitter and Amazon , 2010, CoNLL.

[82]  Mario Cortina-Borja,et al.  Handbook of Parametric and Nonparametric Statistical Procedures, 5th edn , 2012 .

[83]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[84]  Michele Marchesi,et al.  Would you mind fixing this issue? - An Empirical Analysis of Politeness and Attractiveness in Software Developed Using Agile Boards , 2015, XP.

[85]  Nakornthip Prompoon,et al.  Evaluating software quality in use using user reviews mining , 2013, The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE).

[86]  Jure Leskovec,et al.  A computational approach to politeness with application to social factors , 2013, ACL.

[87]  Hareton K. N. Leung,et al.  MSR4SM: Using topic models to effectively mining software repositories for software maintenance tasks , 2015, Inf. Softw. Technol..

[88]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[89]  Leif Singer,et al.  Assessing Technical Candidates on the Social Web , 2013, IEEE Software.

[90]  Bruno D. Zumbo,et al.  Parametric Alternatives to the Student T Test under Violation of Normality and Homogeneity of Variance , 1992 .

[91]  Ying Li,et al.  Incident Ticket Analytics for IT Application Management Services , 2014, 2014 IEEE International Conference on Services Computing.

[92]  Gregorio Robles,et al.  SENTIMENT ANALYSIS OF FREE/OPEN SOURCE DEVELOPERS: PRELIMINARY FINDINGS FROM A CASE STUDY , 2014, Revista Eletrônica de Sistemas de Informação.

[93]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[94]  Paul Pritchard Some negative results concerning prime number generators , 1984, CACM.

[95]  Hridesh Rajan,et al.  Boa: A language and infrastructure for analyzing ultra-large-scale software repositories , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[96]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.