The latent words language model

We present a new generative model of natural language, the latent words language model. This model uses a latent variable for every word in a text that represents synonyms or related words in the given context. We develop novel methods to train this model and to find the expected value of these latent variables for a given unseen text. The learned word similarities help to reduce the sparseness problems of traditional n-gram language models. We show that the model significantly outperforms interpolated Kneser-Ney smoothing and class-based language models on three different corpora. Furthermore the latent variables are useful features for information extraction. We show that both for semantic role labeling and word sense disambiguation, the performance of a supervised classifier increases when incorporating these variables as extra features. This improvement is especially large when using only a small annotated corpus for training.

[1]  Mari Ostendorf,et al.  Transforming out-of-domain estimates to improve in-domain language models , 1997, EUROSPEECH.

[2]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  Vassilios Digalakis,et al.  Techniques to Achieve an Accurate Real-Time Large-Vocabulary Speech Recognition System , 1994, HLT.

[5]  Brian Roark,et al.  Probabilistic Top-Down Parsing and Language Modeling , 2001, CL.

[6]  Terry Winograd,et al.  Understanding natural language , 1974 .

[7]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[8]  Daniel Jurafsky,et al.  Automatic Labeling of Semantic Roles , 2002, CL.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[11]  Christian Posse,et al.  PNNL: A Supervised Maximum Entropy Approach to Word Sense Disambiguation , 2007, SemEval@ACL.

[12]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[13]  Jeffrey Gruber Studies in lexical relations , 1965 .

[14]  Hai Zhao,et al.  Multilingual Dependency Learning: Exploiting Rich Features for Tagging Syntactic and Semantic Dependencies , 2009, CoNLL Shared Task.

[15]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[16]  Mary P. Harper,et al.  On the complexity of explicit duration HMM's , 1995, IEEE Trans. Speech Audio Process..

[17]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[18]  Eugene Charniak,et al.  Immediate-Head Parsing for Language Models , 2001, ACL.

[19]  Subhrakanti Dey,et al.  Complexity reduction in fixed-lag smoothing for hidden Markov models , 2002, IEEE Trans. Signal Process..

[20]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[21]  Richard Johansson,et al.  The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies , 2008, CoNLL.

[22]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[23]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[24]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[25]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[26]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[27]  Martha Palmer,et al.  The English all-words task , 2004, SENSEVAL@ACL.

[28]  John B. Moore,et al.  On-line estimation of hidden Markov model parameters based on the Kullback-Leibler information measure , 1993, IEEE Trans. Signal Process..

[29]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[30]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[31]  Marie-Francine Moens,et al.  Semi-supervised Semantic Role Labeling Using the Latent Words Language Model , 2009, EMNLP.

[32]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[33]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[34]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[35]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[36]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Chris Quirk,et al.  Improved Smoothing for N-gram Language Models Based on Ordinary Counts , 2009, ACL.

[38]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[39]  Charles J. Fillmore,et al.  THE CASE FOR CASE. , 1967 .

[40]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[41]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[42]  Dirk Van Compernolle,et al.  Language modeling with probabilistic left corner parsing , 2005, Comput. Speech Lang..

[43]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[44]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[45]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[46]  Adam Kilgarriff,et al.  SENSEVAL: an exercise in evaluating world sense disambiguation programs , 1998, LREC.

[47]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[48]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[49]  Yasemin Altun,et al.  Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger , 2006, EMNLP.

[50]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[51]  Ralph Grishman,et al.  Generalizing Automatically Generated Selectional Patterns , 1994, COLING.

[52]  Wei Li,et al.  Semi-Supervised Sequence Modeling with Syntactic Topic Models , 2005, AAAI.

[53]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[54]  Joakim Nivre,et al.  MaltParser: A Data-Driven Parser-Generator for Dependency Parsing , 2006, LREC.

[55]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[56]  Patrick Henry Winston,et al.  The psychology of computer vision , 1976, Pattern Recognit..

[57]  Tanja Schultz,et al.  Unsupervised language model adaptation using latent semantic marginals , 2006, INTERSPEECH.

[58]  Rohini K. Srihari,et al.  Combining Statistical and Syntactic Methods in Recognizing Handwritten Sentences , 1992 .

[59]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[60]  Q.I. Wang,et al.  Improved estimation for unsupervised part-of-speech tagging , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[61]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[62]  Richard Johansson,et al.  Dependency-based Syntactic–Semantic Analysis with PropBank and NomBank , 2008, CoNLL.

[63]  Hae-Chang Rim,et al.  Semantic Role Labeling using Maximum Entropy Model , 2004, CoNLL.

[64]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[65]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[66]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[67]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[68]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[69]  Roger C. Schank,et al.  Scripts, plans, goals and understanding: an inquiry into human knowledge structures , 1978 .

[70]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[71]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[72]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[73]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[74]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[75]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[76]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[77]  H. Kobayashi,et al.  An efficient forward-backward algorithm for an explicit-duration hidden Markov model , 2003, IEEE Signal Processing Letters.

[78]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[79]  Christiane Fellbaum,et al.  Building Semantic Concordances , 1998 .