Software-specific part-of-speech tagging: an experimental study on stack overflow

Part-of-speech (POS) tagging performance degrades on out-of-domain data due to the lack of domain knowledge. Software engineering knowledge, embodied in textual documentations, bug reports and online forum discussions, is expressed in natural language, but is full of domain terms, software entities and software-specific informal languages. Such software texts call for software-specific POS tagging. In the software engineering community, there have been several attempts leveraging POS tagging technique to help solve software engineering tasks. However, little work is done for POS tagging on software natural language texts. In this paper, we build a software-specific POS tagger, called S-POS, for processing the textual discussions on Stack Overflow. We target at Stack Overflow because it has become an important developer-generated knowledge repository for software engineering. We define a POS tagset that is suitable for describing software engineering knowledge, select corpus, develop a custom tokenizer, annotate data, design features for supervised model training, and demonstrate that the tagging accuracy of S-POS outperforms that of the Stanford POS Tagger when tagging software texts. Our work presents a feasible roadmap to build software-specific POS tagger for the socio-professional contents on Stack Overflow, and reveals challenges and opportunities for advanced software-specific information extraction.

[1]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[2]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[3]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[4]  Zarinah Mohd Kasirun,et al.  Why so complicated? Simple term filtering and weighting for location-based bug report assignment recommendation , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[5]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[6]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[7]  Krisztian Balog,et al.  Entity linking and retrieval for semantic search , 2014, WSDM.

[8]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[9]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[10]  Mark Hepple,et al.  Independence and Commitment: Assumptions for Rapid Training and Execution of Rule-based POS Taggers , 2000, ACL.

[11]  Qun Liu,et al.  Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging , 2008, COLING.

[12]  Andrea De Lucia,et al.  Improving IR‐based traceability recovery via noun‐based indexing of software artifacts , 2013, J. Softw. Evol. Process..

[13]  Christoph Treude,et al.  Crowd Documentation : Exploring the Coverage and the Dynamics of API Discussions on Stack Overflow , 2012 .

[14]  David W. Binkley,et al.  Improving identifier informativeness using part of speech information , 2011, MSR '11.

[15]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[16]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[17]  Jeffrey C. Carver,et al.  Part-of-speech tagging of program identifiers for improved text-based software engineering tools , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[18]  David Lo,et al.  A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[19]  Paolo Tonella,et al.  Natural Language Parsing of Program Element Names for Concept Extraction , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[20]  Teresa Lynn,et al.  Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets , 2015, NUT@IJCNLP.

[21]  Qun Liu,et al.  A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging , 2008, ACL.

[22]  Clémentine Nebut,et al.  Automatic Extraction of a WordNet-Like Identifier Network from Software , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[23]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[24]  Lena Mamykina,et al.  Design lessons from the fastest q&a site in the west , 2011, CHI.