Building Language Models for Text with Named Entities

Text in many domains involves a significant amount of named entities. Predict- ing the entity names is often challenging for a language model as they appear less frequent on the training corpus. In this paper, we propose a novel and effective approach to building a discriminative language model which can learn the entity names by leveraging their entity type information. We also introduce two benchmark datasets based on recipes and Java programming codes, on which we evalu- ate the proposed model. Experimental re- sults show that our model achieves 52.2% better perplexity in recipe generation and 22.06% on code generation than the state-of-the-art language models.

[1]  Zhendong Su,et al.  A study of the uniqueness of source code , 2010, FSE '10.

[2]  Baishakhi Ray,et al.  Some from Here, Some from There: Cross-Project Code Reuse in GitHub , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[3]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[4]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[7]  Kim B. Bruce Safe type checking in a statically-typed object-oriented programming language , 1993, POPL '93.

[8]  Premkumar T. Devanbu,et al.  A large scale study of programming languages and code quality in github , 2014, SIGSOFT FSE.

[9]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Donald E. Knuth,et al.  Literate Programming , 1984, Comput. J..

[11]  Christian Bird,et al.  The Uniqueness of Changes: Characteristics and Applications , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[12]  Premkumar T. Devanbu,et al.  On the localness of software , 2014, SIGSOFT FSE.

[13]  Jianfeng Gao,et al.  deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.

[14]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[15]  Baishakhi Ray,et al.  GitcProc: a tool for processing and classifying GitHub commits , 2017, ISSTA.

[16]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[17]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[18]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[19]  Premkumar T. Devanbu,et al.  Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[20]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[21]  Yejin Choi,et al.  Globally Coherent Text Generation with Neural Checklist Models , 2016, EMNLP.

[22]  M. Herzog,et al.  Combining word- and class-based language models: a comparative study in several languages using automatic and manual word-clustering techniques , 2001, INTERSPEECH.

[23]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[24]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[25]  Thomas W. Parsons Introduction to Compiler Construction , 1992 .

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[28]  Adam Tauman Kalai,et al.  Counterfactual Language Model Adaptation for Suggesting Phrases , 2017, IJCNLP.

[29]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[30]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[31]  Yejin Choi,et al.  Dynamic Entity Representations in Neural Language Models , 2017, EMNLP.

[32]  Arie van Deursen,et al.  Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering , 2017, ESEC/SIGSOFT FSE.

[33]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[34]  Hang Li,et al.  “ Tony ” DNN Embedding for “ Tony ” Selective Read for “ Tony ” ( a ) Attention-based Encoder-Decoder ( RNNSearch ) ( c ) State Update s 4 SourceVocabulary Softmax Prob , 2016 .

[35]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.