Hierarchical Probabilistic Neural Network Language Model

In recent years, variants of a neural network architecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech recognizers. The main advantage of these architectures is that they learn an embedding for words (or other symbols) in a continuous space that helps to smooth the language model and provide good generalization even when the number of training examples is insufficient. However, these models are extremely slow in comparison to the more commonly used n-gram models, both for training and recognition. As an alternative to an importance sampling method proposed to speed-up training, we introduce a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy.

[1]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4]  Geoffrey E. Hinton Learning distributed representations of concepts. , 1989 .

[5]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[6]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[7]  Risto Miikkulainen,et al.  Natural Language Processing With Modular PDP Networks and Distributed Lexicon , 1991, Cogn. Sci..

[8]  Risto Miikkulainen,et al.  Natural Language Processing With Modular PDP Networks and Distributed Lexicon , 1991, Cogn. Sci..

[9]  Risto Miikkulainen,et al.  Natural Language Processingwith Modular Neural Networks and Distributed Lexicon , 1991 .

[10]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[11]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[12]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[13]  Hermann Ney,et al.  Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[14]  Jürgen Schmidhuber,et al.  Sequential neural text compression , 1996, IEEE Trans. Neural Networks.

[15]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[16]  Thomas Niesler,et al.  Comparison of part-of-speech and automatically derived category-based language models for speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[18]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[19]  Wei Xu,et al.  Can artificial neural networks learn language models? , 2000, INTERSPEECH.

[20]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[21]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[22]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[23]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[24]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[25]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Ahmad Emami,et al.  Training Connectionist Models for the Structured Language Model , 2003, EMNLP.

[27]  Blockin Blockin Quick Training of Probabilistic Neural Nets by Importance Sampling , 2003 .

[28]  H. Schwenk,et al.  Efficient training of large neural networks for language modeling , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).