Hierarchical Distributed Representations for Statistical Language Modeling

Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts.

[1]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[2]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[3]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[4]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[5]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[6]  Fernando Pereira,et al.  Aggregate and mixed-order Markov models for statistical language processing , 1997, EMNLP.

[7]  Thomas Hofmann,et al.  Mixture models for co-occurrence and histogram data , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[8]  B. Borchers A C library for semidefinite programming , 1999 .

[9]  Henry Wolkowicz,et al.  Solving Euclidean Distance Matrix Completion Problems Via Semidefinite Programming , 1999, Comput. Optim. Appl..

[10]  B. Borchers CSDP, A C library for semidefinite programming , 1999 .

[11]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[12]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[13]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.

[14]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[15]  Bernhard Schölkopf,et al.  A kernel view of the dimensionality reduction of manifolds , 2004, ICML.