Recognizing Software Bug-Specific Named Entity in Software Bug Repository

Software bug issues are unavoidable in software development and maintenance. In order to manage bugs effectively, bug tracking systems are developed to help to record, manage and track the bugs of each project. The rich information in the bug repository provides the possibility of establishment of entity-centric knowledge bases to help understand and fix the bugs. However, existing named entity recognition (NER) systems deal with text that is structured, formal, well written, with a good grammatical structure and few spelling errors, which cannot be directly used for bug-specific named entity recognition. For bug data, they are free-form texts, which include a mixed language studded with code, abbreviations and software-specific vocabularies. In this paper, we summarize the characteristics of bug entities, propose a classification method for bug entities, and build a baseline corpus on two open source projects (Mozilla and Eclipse). On this basis, we propose an approach for bug-specific entity recognition called BNER with the Conditional Random Fields (CRF) model and word embedding technique. An empirical study is conducted to evaluate the accuracy of our BNER technique, and the results show that the two designed baseline corpus are suitable for bug-specific named entity recognition, and our BNER approach is e?ective on cross-projects NER.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  Yonggang Zhang,et al.  Ontological Text Mining of Software Documents , 2007, NLDB.

[3]  Oscar Chaparro Improving Bug Reporting, Duplicate Detection, and Localization , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[4]  Zarinah Mohd Kasirun,et al.  Why so complicated? Simple term filtering and weighting for location-based bug report assignment recommendation , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[5]  Xiaobing Sun,et al.  Enhancing developer recommendation with supplementary information via mining historical commits , 2017, J. Syst. Softw..

[6]  Bin Li,et al.  DR_PSF: Enhancing Developer Recommendation by Leveraging Personalized Source-Code Files , 2016, 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).

[7]  Jing Li,et al.  Software-Specific Named Entity Recognition in Software Engineering Social Content , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[8]  Ming Gao,et al.  A retrospective of knowledge graphs , 2018, Frontiers of Computer Science.

[9]  N. K. Nagwani,et al.  Summarizing large text collection using topic modeling and clustering based on MapReduce framework , 2015, Journal of Big Data.

[10]  Muhammad Younus Javed,et al.  An Automated Approach for Software Bug Classification , 2012, 2012 Sixth International Conference on Complex, Intelligent, and Software Intensive Systems.

[11]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[12]  Po Hu,et al.  Learning Continuous Word Embedding with Metadata for Question Retrieval in Community Question Answering , 2015, ACL.

[13]  David Broman,et al.  Automated bug assignment: Ensemble-based machine learning in large scale industrial contexts , 2016, Empirical Software Engineering.

[14]  Anthony N. Nguyen,et al.  Analysis of Word Embeddings and Sequence Features for Clinical Information Extraction , 2015, ALTA.

[15]  Hany Hassan Awadalla,et al.  Improving Named Entity Translation by Exploiting Comparable and Parallel Corpora , 2016 .

[16]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[17]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[18]  Özlem Uzuner,et al.  Prescription extraction using CRFs and word embeddings , 2017, J. Biomed. Informatics.

[19]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[20]  Andreas Zeller,et al.  It's not a bug, it's a feature: How misclassification impacts bug prediction , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[21]  Patrick Pantel,et al.  Jigs and Lures: Associating Web Queries with Structured Entities , 2011, ACL.

[22]  Satoshi Sekine,et al.  Extended Named Entity Recognition API and Its Applications in Language Education , 2017, ACL.

[23]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[24]  Bin Li,et al.  Mining Software Repositories for Automatic Interface Recommendation , 2016, Sci. Program..

[25]  Frederick Reiss,et al.  Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks , 2010, EMNLP.

[26]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[27]  Rosziati Ibrahim,et al.  An Automatic Tool for Generating Test Cases from the System's Requirements , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).

[28]  Martin P. Robillard,et al.  Discovering essential code elements in informal documentation , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[29]  Guruvayur Mahalakshmi,et al.  Named entity recognition for automated test case generation , 2018, Int. Arab J. Inf. Technol..

[30]  Yefeng Wang,et al.  Annotating and Recognising Named Entities in Clinical Notes , 2009, ACL.

[31]  Hareton K. N. Leung,et al.  MSR4SM: Using topic models to effectively mining software repositories for software maintenance tasks , 2015, Inf. Softw. Technol..

[32]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[33]  Premkumar T. Devanbu,et al.  The missing links: bugs and bug-fix commits , 2010, FSE '10.

[34]  Bin Li,et al.  Exploring topic models in software engineering data analysis: A survey , 2016, 2016 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[35]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[36]  Lu Wang,et al.  Construct Bug Knowledge Graph for Bug Resolution , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[37]  Yuanyuan Zhou,et al.  Bug characteristics in open source software , 2013, Empirical Software Engineering.

[38]  Inderpal S. Bhandari,et al.  Orthogonal Defect Classification - A Concept for In-Process Measurements , 1992, IEEE Trans. Software Eng..

[39]  Bin Li,et al.  Recommending Developers with Supplementary Information for Issue Request Resolution , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[40]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[41]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[42]  Geoffrey E. Hinton,et al.  Learning Distributed Representations of Concepts Using Linear Relational Embedding , 2001, IEEE Trans. Knowl. Data Eng..

[43]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[44]  Randall T. Schuh,et al.  True bugs of the world (Hemiptera:Heteroptera) : classification and natural history , 1995 .

[45]  Georgios Gousios,et al.  Matching GitHub Developer Profiles to Job Advertisements , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[46]  Bin Li,et al.  An Empirical Study on Real Bugs for Machine Learning Programs , 2017, 2017 24th Asia-Pacific Software Engineering Conference (APSEC).

[47]  Hareton K. N. Leung,et al.  Effectiveness of exploring historical commits for developer recommendation: an empirical study , 2018, Frontiers of Computer Science.

[48]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.