Natural language processing in mining unstructured data from software repositories: a review

With the increasing popularity of open-source platforms, software data is easily available from various open-source tools like GitHub, CVS, SVN, etc. More than 80 percent of the data present in them is unstructured. Mining data from these repositories helps project managers, developers and businesses, in getting interesting insights. Most of the software artefacts present in these repositories are in the natural language form, which makes natural language processing (NLP) an important part of mining to get the useful results. The paper reviews the application of NLP techniques in the field of Mining Software Repositories (MSR). The paper mainly focuses on sentiment analysis, summarization, traceability, norms mining and mobile analytics. The paper presents the major NLP works performed in this area by surveying the research papers from 2000 to 2018. The paper firstly describes the major artefacts present in the software repositories where the NLP techniques have been applied. Next, the paper presents some popular open-source NLP tools that have been used to perform NLP tasks. Later the paper discusses, in brief, the research state of NLP in MSR field. The paper also lists down the various challenges along with the pointers for future work in this field of research and finally the conclusion.

[1]  Fernando Leandro dos Santos,et al.  The Role of Text Pre-processing in Opinion Mining on a Social Media Language Dataset , 2014, 2014 Brazilian Conference on Intelligent Systems.

[2]  Gabriele Bavota,et al.  Mining Unstructured Data in Software Repositories: Current and Future Trends , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[3]  Martin White,et al.  Toward Deep Learning Software Repositories , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[4]  Michele Marchesi,et al.  Are Bullies More Productive? Empirical Study of Affectiveness vs. Issue Fixing Time , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[5]  S. K. Gupta,et al.  Abstractive summarization: An overview of the state of the art , 2019, Expert Syst. Appl..

[6]  Peter C. Rigby,et al.  Leveraging Informal Documentation to Summarize Classes and Methods in Context , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[7]  Li Chen,et al.  A heterogeneous hidden Markov model for mobile app recommendation , 2017, Knowledge and Information Systems.

[8]  Phuc Nhan Minh An Approach to Detecting Duplicate Bug Reports using N-gram Features and Cluster Chrinkage Technique , 2014 .

[9]  Andrian Marcus,et al.  Improving traceability link recovery methods through software artifact summarization , 2011, TEFSE '11.

[10]  Gabriele Bavota,et al.  Sentiment Analysis for Software Engineering: How Far Can We Go? , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[11]  Xindong Wu EIC Editorial: State of the Transactions , 2006, IEEE Trans. Knowl. Data Eng..

[12]  Yang Li,et al.  Sentiment analysis of commit comments in GitHub: an empirical study , 2014, MSR 2014.

[13]  Vijay K. Mago,et al.  Calculating the similarity between words and sentences using a lexical database and corpus statistics , 2018, ArXiv.

[14]  Kristina Winbladh,et al.  Analysis of user comments: An approach for software requirements evolution , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[15]  Tao Zhang,et al.  Source code fragment summarization with small-scale crowdsourcing based features , 2015, Frontiers of Computer Science.

[16]  Minhaz Fahim Zibran,et al.  Leveraging Automated Sentiment Analysis in Software Engineering , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[17]  Nicole Novielli,et al.  Sentiment Polarity Detection for Software Development , 2017, Empirical Software Engineering.

[18]  I. Perera,et al.  Support for traceability management of software artefacts using Natural Language Processing , 2016, 2016 Moratuwa Engineering Research Conference (MERCon).

[19]  Harish Karnick,et al.  Text Summarization using Abstract Meaning Representation , 2017, ArXiv.

[20]  C. Kavitha,et al.  Entity based source code summarization (EBSCS) , 2016, 2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS).

[21]  Giuliano Antoniol,et al.  The Use of Text Retrieval and Natural Language Processing in Software Engineering , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[22]  Bojan Cukic,et al.  Automated Duplicate Bug Report Classification Using Subsequence Matching , 2012, 2012 IEEE 14th International Symposium on High-Assurance Systems Engineering.

[23]  Bram Adams,et al.  Do developers feel emotions? an exploratory analysis of emotions in software artifacts , 2014, MSR 2014.

[24]  Gail C. Murphy,et al.  Generating natural language summaries for crosscutting source code concerns , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[25]  P. Nagabhushan,et al.  A NOVEL PERCEPTUAL IMAGE ENCRYPTION SCHEME USING GEOMETRIC OBJECTS BASED KERNEL , 2013 .

[26]  Collin McMillan,et al.  Automatic Source Code Summarization of Context for Java Methods , 2016, IEEE Transactions on Software Engineering.

[27]  Emily Hill,et al.  Towards automatically generating summary comments for Java methods , 2010, ASE.

[28]  Li Zhang,et al.  CSLabel: An Approach for Labelling Mobile App Reviews , 2017, Journal of Computer Science and Technology.

[29]  Hui Xu,et al.  AR-Tracker: Track the Dynamics of Mobile Apps via User Review Mining , 2015, 2015 IEEE Symposium on Service-Oriented System Engineering.

[30]  Rachel Harrison,et al.  Retrieving and analyzing mobile apps feature requests from online reviews , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[31]  Rémi Lebret,et al.  Word Embeddings for Natural Language Processing , 2016 .

[32]  Bin Li,et al.  On Automatic Summarization of What and Why Information in Source Code Changes , 2016, 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).

[33]  S. Chitrakala,et al.  A survey on abstractive text summarization , 2016, 2016 International Conference on Circuit, Power and Computing Technologies (ICCPCT).

[34]  Tung Thanh Nguyen,et al.  Mining User Opinions in Mobile App Reviews: A Keyword-Based Approach (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[35]  Krzysztof Czarnecki,et al.  Modelling the 'Hurried' bug report reading process to summarize bug reports , 2012, ICSM.

[36]  Ahmed E. Hassan,et al.  A survey on the use of topic models when mining software repositories , 2015, Empirical Software Engineering.

[37]  Bram Adams,et al.  Monitoring sentiment in open source mailing lists: exploratory study on the apache ecosystem , 2014, CASCON.

[38]  Michele Marchesi,et al.  The Emotional Side of Software Developers in JIRA , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[39]  M. Aref,et al.  Semantic graph reduction approach for abstractive Text Summarization , 2012, 2012 Seventh International Conference on Computer Engineering & Systems (ICCES).

[40]  Gail C. Murphy,et al.  Automatic Summarization of Bug Reports , 2014, IEEE Transactions on Software Engineering.

[41]  Wasi Haider Butt,et al.  The Applications of Natural Language Processing (NLP) for Software Requirement Engineering - A Systematic Literature Review , 2017, ICISA.

[42]  Ahmed E. Hassan,et al.  Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews , 2015, Empirical Software Engineering.

[43]  Juergen Rilling,et al.  Mining Bug Repositories--A Quality Assessment , 2008, 2008 International Conference on Computational Intelligence for Modelling Control & Automation.

[44]  Saidah Saad,et al.  Sentiment Analysis or Opinion Mining: A Review , 2017 .

[45]  Cor-Paul Bezemer,et al.  Studying the consistency of star ratings and reviews of popular free hybrid Android and iOS apps , 2018, Empirical Software Engineering.

[46]  Noah A. Smith,et al.  Toward Abstractive Summarization Using Semantic Representations , 2018, NAACL.

[47]  Som Gupta,et al.  Summarization of Software Artifacts : A Review , 2017 .

[48]  Gabriele Bavota,et al.  ARENA: An Approach for the Automated Generation of Release Notes , 2017, IEEE Transactions on Software Engineering.

[49]  Michele Lanza,et al.  Summarizing Complex Development Artifacts by Mining Heterogeneous Data , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[50]  A.E. Hassan,et al.  The road ahead for Mining Software Repositories , 2008, 2008 Frontiers of Software Maintenance.

[51]  Anindya Iqbal,et al.  SentiCR: A customized sentiment analysis tool for code review interactions , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[52]  Naomie Salim,et al.  A review on abstractive summarization methods , 2014 .

[53]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[54]  Andrian Marcus,et al.  On the Use of Automated Text Summarization Techniques for Summarizing Source Code , 2010, 2010 17th Working Conference on Reverse Engineering.

[55]  Westley Weimer,et al.  Automatic documentation inference for exceptions , 2008, ISSTA '08.

[56]  David Lo,et al.  RCLinker: Automated Linking of Issue Reports and Commits Leveraging Rich Contextual Information , 2015, 2015 IEEE 23rd International Conference on Program Comprehension.

[57]  Eleni Stroulia,et al.  On the Personality Traits of StackOverflow Users , 2013, 2013 IEEE International Conference on Software Maintenance.

[58]  Bastin Tony Roy Savarimuthu,et al.  Towards Mining Norms in Open Source Software Repositories , 2013, ADMI.

[59]  Michael Goul,et al.  Managing the Enterprise Business Intelligence App Store: Sentiment Analysis Supported Requirements Engineering , 2012, 2012 45th Hawaii International Conference on System Sciences.

[60]  Mario Linares Vásquez,et al.  On Automatically Generating Commit Messages via Summarization of Source Code Changes , 2014, 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation.

[61]  Bastin Tony Roy Savarimuthu,et al.  Norm creation, spreading and emergence: A survey of simulation models of norms in multi-agent systems , 2011, Multiagent Grid Syst..

[62]  Ashish Sureka,et al.  Detecting Duplicate Bug Report Using Character N-Gram-Based Features , 2010, 2010 Asia Pacific Software Engineering Conference.

[63]  Bojan Cukic,et al.  Handling Language Variations in Open Source Bug Reporting Systems , 2012, 2012 IEEE 23rd International Symposium on Software Reliability Engineering Workshops.

[64]  Mario Linares Vásquez,et al.  ChangeScribe: A Tool for Automatically Generating Commit Messages , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[65]  Bonita Sharif,et al.  Analyzing Developer Sentiment in Commit Logs , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[66]  Boyang Li,et al.  Automatically Documenting Unit Test Cases , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[67]  Tao Jiang,et al.  Constrained query of order-preserving submatrix in gene expression data , 2016, Frontiers of Computer Science.

[68]  Bastin Tony Roy Savarimuthu,et al.  Mining Software Repositories for Social Norms , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[69]  Anne Kao,et al.  Natural Language Processing and Text Mining , 2006 .