Code and Named Entity Recognition in StackOverflow

There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained in-domain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which lead to an absolute increase of +10 F-1 score over off-the-shelf BERT. We also present the SoftNER model which achieves an overall 79.10 F$_1$ score for code and named entity recognition on StackOverflow data. Our SoftNER model incorporates a context-independent code token classifier with corpus-level features to improve the BERT-based tagging model. Our code and data are available at: this https URL

[1]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[4]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[5]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[6]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[7]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[8]  Xuanjing Huang,et al.  Adaptive Co-attention Network for Named Entity Recognition in Tweets , 2018, AAAI.

[9]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[10]  Yasumasa Onoe,et al.  Learning to Denoise Distantly-Labeled Data for Entity Typing , 2019, NAACL.

[11]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[12]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[13]  Ravi Kumar,et al.  Great Question! Question Quality in Community Q&A , 2014, ICWSM.

[14]  Omer Levy,et al.  Ultra-Fine Entity Typing , 2018, ACL.

[15]  William W. Cohen,et al.  KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts , 2015, ACL.

[16]  Dirk Hovy,et al.  Learning part-of-speech taggers with inter-annotator agreement loss , 2014, EACL.

[17]  Raymond J. Mooney,et al.  Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes , 2015, ACL.

[18]  Luke S. Zettlemoyer,et al.  Cloze-driven Pretraining of Self-attention Networks , 2019, EMNLP.

[19]  Mark Steedman,et al.  Example Selection for Bootstrapping Statistical Parsers , 2003, NAACL.

[20]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[21]  Inanç Birol,et al.  In-domain Context-aware Token Embeddings Improve Biomedical Named Entity Recognition , 2018, Louhi@EMNLP.

[22]  Gary D. Bader,et al.  Transfer learning for biomedical named entity recognition with neural networks , 2018, bioRxiv.

[23]  Gourab Kundu,et al.  Neural Cross-Lingual Entity Linking , 2017, AAAI.

[24]  Georgios Gousios,et al.  GHTorrent: Github's data from a firehose , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[25]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[26]  Jaime G. Carbonell,et al.  Neural Cross-Lingual Named Entity Recognition with Minimal Resources , 2018, EMNLP.

[27]  Gary D. Bader,et al.  Transfer learning for biomedical named entity recognition with neural networks , 2018 .

[28]  Dan Garrette,et al.  Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification , 2018, EMNLP.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[31]  Benjamin Van Durme,et al.  Annotated Gigaword , 2012, AKBC-WEKEX@NAACL-HLT.

[32]  Nigel Collier,et al.  Bidirectional LSTM for Named Entity Recognition in Twitter Messages , 2016, NUT@COLING.

[33]  Raphaël Troncy,et al.  Analysis of named entity recognition and linking for tweets , 2014, Inf. Process. Manag..

[34]  Jackie Chi Kit Cheung,et al.  World Knowledge for Reading Comprehension: Rare Entity Prediction with Hierarchical LSTMs Using External Descriptions , 2017, EMNLP.

[35]  Fei Liu,et al.  Evaluating the Utility of Hand-crafted Features in Sequence Labelling , 2018, EMNLP.

[36]  Raghu Machiraju,et al.  An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols , 2018, NAACL.

[37]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[38]  Wei Xu,et al.  A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification , 2018, EMNLP.

[39]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[40]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[41]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[42]  Andrew McCallum,et al.  Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets , 2018, EMNLP.

[43]  Thamar Solorio,et al.  Question Relatedness on Stack Overflow: The Task, Dataset, and Corpus-inspired Models , 2019, ArXiv.

[44]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[45]  Jing Li,et al.  Software-Specific Named Entity Recognition in Software Engineering Social Content , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[46]  Julia Hirschberg,et al.  Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task , 2018, CodeSwitch@ACL.

[47]  Thamar Solorio,et al.  A Multi-task Approach for Named Entity Recognition in Social Media Data , 2017, NUT@EMNLP.

[48]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[49]  Huan Sun,et al.  CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning , 2019, WWW.

[50]  Graham Neubig,et al.  TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation , 2018, EMNLP.

[51]  Min Zhang,et al.  Distantly Supervised NER with Partial Annotation Learning and Reinforcement Learning , 2018, COLING.

[52]  Shiliang Zhang,et al.  Neural Networks Models for Entity Discovery and Linking , 2016, ArXiv.

[53]  Cécile Paris,et al.  Using Similarity Measures to Select Pretraining Data for NER , 2019, NAACL.