Exploring Word Embedding Techniques to Improve Sentiment Analysis of Software Engineering Texts

Sentiment analysis (SA) of text-based software artifacts is increasingly used to extract information for various tasks including providing code suggestions, improving development team productivity, giving recommendations of software packages and libraries, and recommending comments on defects in source code, code quality, possibilities for improvement of applications. Studies of state-of-the-art sentiment analysis tools applied to software-related texts have shown varying results based on the techniques and training approaches. In this paper, we investigate the impact of two potential opportunities to improve the training for sentiment analysis of SE artifacts in the context of the use of neural networks customized using the Stack Overflow data developed by Lin et al. We customize the process of sentiment analysis to the software domain, using software domain-specific word embeddings learned from Stack Overflow (SO) posts, and study the impact of software domain-specific word embeddings on the performance of the sentiment analysis tool, as compared to generic word embeddings learned from Google News. We find that the word embeddings learned from the Google News data performs mostly similar and in some cases better than the word embeddings learned from SO posts. We also study the impact of two machine learning techniques, oversampling and undersampling of data, on the training of a sentiment classifier for handling small SE datasets with a skewed distribution. We find that oversampling alone, as well as the combination of oversampling and undersampling together, helps in improving the performance of a sentiment classifier.

[1]  Guodong Zhou,et al.  Semi-Supervised Learning for Imbalanced Sentiment Classification , 2011, IJCAI.

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  Bonggun Shin,et al.  Lexicon Integrated CNN Models with Attention for Sentiment Analysis , 2016, WASSA@EMNLP.

[4]  Nicole Novielli,et al.  A Benchmark Study on Sentiment Analysis for Software Engineering Research , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[5]  Karin Becker,et al.  Sentiment Analysis in Tickets for IT Support , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[6]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[7]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[8]  Diomidis Spinellis,et al.  Word Embeddings for the Software Engineering Domain , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[9]  Gabriele Bavota,et al.  Sentiment Analysis for Software Engineering: How Far Can We Go? , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[10]  Nicole Novielli,et al.  The challenges of sentiment detection in the social programmer ecosystem , 2015, SSE@SIGSOFT FSE.

[11]  Minhaz Fahim Zibran,et al.  A comparison of software engineering domain specific sentiment analysis tools , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[12]  Tao Chen,et al.  Word Embedding Composition for Data Imbalances in Sentiment and Emotion Classification , 2015, Cognitive Computation.

[13]  Bernd Brügge,et al.  Towards emotional awareness in software development teams , 2013, ESEC/FSE 2013.

[14]  Bram Adams,et al.  Do developers feel emotions? an exploratory analysis of emotions in software artifacts , 2014, MSR 2014.

[15]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[16]  Minhaz Fahim Zibran,et al.  Leveraging Automated Sentiment Analysis in Software Engineering , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[17]  Taghi M. Khoshgoftaar,et al.  Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[18]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[19]  Nicole Novielli,et al.  Sentiment Polarity Detection for Software Development , 2017, Empirical Software Engineering.

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Pilsung Kang,et al.  Sentiment Classification with Word Attention based on Weakly Supervised Leaning , 2017 .

[22]  Adiwijaya,et al.  Aspect-based sentiment analysis to review products using Naïve Bayes , 2017 .

[23]  Nicole Novielli,et al.  Towards discovering the role of emotions in stack overflow , 2014, SSE@SIGSOFT FSE.

[24]  Emerson R. Murphy-Hill,et al.  Sentiment and Politeness Analysis Tools on Developer Discussions Are Unreliable, but So Are People , 2018, 2018 IEEE/ACM 3rd International Workshop on Emotion Awareness in Software Engineering (SEmotion).

[25]  Minhaz Fahim Zibran,et al.  DEVA: sensing emotions in the valence arousal space in software engineering text , 2018, SAC.

[26]  Bram Adams,et al.  Monitoring sentiment in open source mailing lists: exploratory study on the apache ecosystem , 2014, CASCON.

[27]  Michele Marchesi,et al.  The Emotional Side of Software Developers in JIRA , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[28]  Alexander Serebrenik,et al.  On negative results when using sentiment analysis tools for software engineering research , 2017, Empirical Software Engineering.

[29]  Pekka Abrahamsson,et al.  Do feelings matter? On the correlation of affects and the self‐assessed productivity in software engineering , 2014, J. Softw. Evol. Process..

[30]  Yingying Zhang,et al.  Extracting problematic API features from forum discussions , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[31]  Hailong Sun,et al.  Entity-Level Sentiment Analysis of Issue Comments , 2018, 2018 IEEE/ACM 3rd International Workshop on Emotion Awareness in Software Engineering (SEmotion).

[32]  Alexander Serebrenik,et al.  Choosing your weapons: On sentiment analysis tools for software engineering research , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[33]  Thomas Fritz,et al.  Stuck and Frustrated or in Flow and Happy: Sensing Developers' Emotions and Progress , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[34]  Chanchal Kumar Roy,et al.  Recommending insightful comments for source code using crowdsourced knowledge , 2015, 2015 IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[35]  Anindya Iqbal,et al.  SentiCR: A customized sentiment analysis tool for code review interactions , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[36]  Duyu Tang,et al.  Sentiment-Specific Representation Learning for Document-Level Sentiment Analysis , 2015, WSDM.