Software-Specific Named Entity Recognition in Software Engineering Social Content

Software engineering social content, such as Q&A discussions on Stack Overflow, has become a wealth of information on software engineering. This textual content is centered around software-specific entities, and their usage patterns, issues-solutions, and alternatives. However, existing approaches to analyzing software engineering texts treat software-specific entities in the same way as other content, and thus cannot support the recent advance of entity-centric applications, such as direct answers and knowledge graph. The first step towards enabling these entity-centric applications for software engineering is to recognize and classify software-specific entities, which is referred to as Named Entity Recognition (NER) in the literature. Existing NER methods are designed for recognizing person, location and organization in formal and social texts, which are not applicable to NER in software engineering. Existing information extraction methods for software engineering are limited to API identification and linking of a particular programming language. In this paper, we formulate the research problem of NER in software engineering. We identify the challenges in designing a software-specific NER system and propose a machine learning based approach applied on software engineering social content. Our NER system, called S-NER, is general for software engineering in that it can recognize a broad category of software entities for a wide range of popular programming languages, platform, and library. We conduct systematic experiments to evaluate our machine learning based S-NER against a well-designed, and to study the effectiveness of widely-adopted NER techniques and features in the face of the unique characteristics of software engineering social content.

[1]  Lena Mamykina,et al.  Design lessons from the fastest q&a site in the west , 2011, CHI.

[2]  Krisztian Balog,et al.  Entity linking and retrieval for semantic search , 2014, WSDM.

[3]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[4]  Martin P. Robillard,et al.  Recovering traceability links between an API and its learning resources , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[5]  Clémentine Nebut,et al.  Automatic Extraction of a WordNet-Like Identifier Network from Software , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[6]  Antonio Toral,et al.  A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia , 2006, Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources.

[7]  Martin P. Robillard,et al.  Discovering essential code elements in informal documentation , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[8]  Romain Robbes,et al.  Linking e-mails and source code artifacts , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[9]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[10]  Yonggang Zhang,et al.  Empowering Software Maintainers with Semantic Web Technologies , 2007, ESWC.

[11]  Bo Xu,et al.  Chinese Named Entity Recognition Combining Statistical Model wih Human Knowledge , 2003, NER@ACL.

[12]  Lori L. Pollock,et al.  Automatically mining software-based, semantically-similar words from comment-code mappings , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[13]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[14]  Harold Ossher,et al.  Automatically locating framework extension examples , 2008, SIGSOFT '08/FSE-16.

[15]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[16]  Jinqiu Yang,et al.  SWordNet: Inferring semantically related words from software context , 2014, Empirical Software Engineering.

[17]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[18]  Jie Tang,et al.  Accurate Product Name Recognition from User Generated Content , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[19]  Jinqiu Yang,et al.  Inferring semantically related words from software context , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[20]  Christoph Treude,et al.  Crowd Documentation : Exploring the Coverage and the Dynamics of API Discussions on Stack Overflow , 2012 .

[21]  Yefeng Wang,et al.  Annotating and Recognising Named Entities in Clinical Notes , 2009, ACL.

[22]  Patrick Pantel,et al.  Jigs and Lures: Associating Web Queries with Structured Entities , 2011, ACL.

[23]  Georgios Gousios,et al.  Matching GitHub Developer Profiles to Job Advertisements , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[24]  K. Bretonnel Cohen,et al.  Biological, translational, and clinical language processing , 2007 .

[25]  Fabio Crestani,et al.  Towards query log based personalization using topic models , 2010, CIKM.

[26]  Philippe A. Palanque,et al.  Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , 2014, International Conference on Human Factors in Computing Systems.

[27]  David Lo,et al.  SEWordSim: software-specific word similarity database , 2014, ICSE Companion.

[28]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[29]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[30]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[31]  Aixin Sun,et al.  Mobile phone name extraction from internet forums: a semi-supervised approach , 2016, World Wide Web.

[32]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[33]  Reid Holmes,et al.  Live API documentation , 2014, ICSE.

[34]  Jeffrey C. Carver,et al.  Part-of-speech tagging of program identifiers for improved text-based software engineering tools , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[35]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[36]  Ebrahim Bagheri,et al.  Semantic tagging and linking of software engineering social content , 2014, Automated Software Engineering.

[37]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[38]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[39]  David Lo,et al.  NIRMAL: Automatic identification of software relevant tweets leveraging language model , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[40]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[41]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[42]  Po Hu,et al.  Learning Continuous Word Embedding with Metadata for Question Retrieval in Community Question Answering , 2015, ACL.

[43]  Michael Gamon,et al.  Active objects: actions for entity-centric search , 2012, WWW.

[44]  Andrea De Lucia,et al.  Improving IR‐based traceability recovery via noun‐based indexing of software artifacts , 2013, J. Softw. Evol. Process..

[45]  Harald C. Gall,et al.  Development Emails Content Analyzer: Intention Mining in Developer Discussions (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[46]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[47]  Zarinah Mohd Kasirun,et al.  Why so complicated? Simple term filtering and weighting for location-based bug report assignment recommendation , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[48]  Khaled Shaalan,et al.  NERA: Named Entity Recognition for Arabic , 2009 .

[49]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[50]  Jun'ichi Tsujii,et al.  Reranking for Biomedical Named-Entity Recognition , 2007, BioNLP@ACL.