Word Embeddings for the Software Engineering Domain

The software development process produces vast amounts of textual data expressed in natural language. Outcomes from the natural language processing community have been adapted in software engineering research for leveraging this rich textual information; these include methods and readily available tools, often furnished with pretrained models. State of the art pretrained models however, capture general, common sense knowledge, with limited value when it comes to handling data specific to a specialized domain. There is currently a lack of domain-specific pretrained models that would further enhance the processing of natural language artefacts related to software engineering. To this end, we release a word2vec model trained over 15GB of textual data from Stack Overflow posts. We illustrate how the model disambiguates polysemous words by interpreting them within their software engineering context. In addition, we present examples of fine-grained semantics captured by the model, that imply transferability of these results to diverse, targeted information retrieval tasks in software engineering and motivate for further reuse of the model.

[1]  Björn Regnell,et al.  A linguistic-engineering approach to large-scale requirements management , 2005, IEEE Software.

[2]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[3]  Ahmed E. Hassan,et al.  What are developers talking about? An analysis of topics and trends in Stack Overflow , 2014, Empirical Software Engineering.

[4]  Michele Marchesi,et al.  Are Bullies More Productive? Empirical Study of Affectiveness vs. Issue Fixing Time , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[5]  Vincenzo Gervasi,et al.  Processing natural language requirements , 1997, Proceedings 12th IEEE International Conference Automated Software Engineering.

[6]  Mika Mäntylä,et al.  Mining Valence, Arousal, and Dominance - Possibilities for Detecting Burnout and Productivity? , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[7]  Nicole Novielli,et al.  Anger and Its Direction in Collaborative Software Development , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track (ICSE-NIER).

[8]  Kathleen M. Carley,et al.  Modeling Similarity in Incentivized Interaction : A Longitudinal Case Study of StackOverFlow , 2015 .

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[11]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[12]  René Witte,et al.  Automatic Quality Assessment of Source Code Comments: The JavadocMiner , 2010, NLDB.

[13]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[14]  Zhenchang Xing,et al.  The structure and dynamics of knowledge network in domain-specific Q&A sites: a case study of stack overflow , 2017, Empirical Software Engineering.

[15]  Ruzanna Chitchyan,et al.  EA-Miner: a tool for automating aspect-oriented requirements identification , 2005, ASE.

[16]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[17]  Nicole Novielli,et al.  Bootstrapping a Lexicon for Emotional Arousal in Software Engineering , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[18]  Christian Bird,et al.  Characteristics of Useful Code Reviews: An Empirical Study at Microsoft , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[19]  Tim Menzies,et al.  What is wrong with topic modeling? And how to fix it using search-based software engineering , 2016, Inf. Softw. Technol..

[20]  Tim Menzies,et al.  Easy over hard: a case study on deep learning , 2017, ESEC/SIGSOFT FSE.

[21]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22]  Zhenchang Xing,et al.  Predicting semantically linkable knowledge in developer online forums via convolutional neural network , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[24]  Nikolaos Tsantalis,et al.  Using Natural Language Processing to Automatically Detect Self-Admitted Technical Debt , 2017, IEEE Transactions on Software Engineering.

[25]  David Lo,et al.  Automated construction of a software-specific word similarity database , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[26]  Stefan Wagner Natural language processing is no free lunch , 2016, Perspectives on Data Science for Software Engineering.

[27]  Cristina V. Lopes,et al.  From Query to Usable Code: An Analysis of Stack Overflow Code Snippets , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[28]  Chanchal Kumar Roy,et al.  Predicting Usefulness of Code Review Comments Using Textual Features and Developer Experience , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[29]  Christoph Treude,et al.  Choosing an NLP Library for Analyzing Software Documentation: A Systematic Literature Review and a Series of Experiments , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[30]  Chanchal Kumar Roy,et al.  Mining Duplicate Questions of Stack Overflow , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[31]  Michele Marchesi,et al.  The Emotional Side of Software Developers in JIRA , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[32]  Alexander Serebrenik,et al.  On negative results when using sentiment analysis tools for software engineering research , 2017, Empirical Software Engineering.