Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model

Previous work on statistical language modeling has shown that it is possible to train a feedforward neural network to approximate probabilities over sequences of words, resulting in significant error reduction when compared to standard baseline models based on n-grams. However, training the neural network model with the maximum-likelihood criterion requires computations proportional to the number of words in the vocabulary. In this paper, we introduce adaptive importance sampling as a way to accelerate training of the model. The idea is to use an adaptive n-gram model to track the conditional distributions produced by the neural network. We show that a very significant speedup can be obtained on standard problems.

[1]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[2]  Jack Perkins,et al.  Pattern recognition in practice , 1980 .

[3]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[4]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[5]  Geoffrey E. Hinton Learning distributed representations of concepts. , 1989 .

[6]  Jun S. Liu,et al.  Sequential Imputations and Bayesian Missing Data Problems , 1994 .

[7]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Anders Krogh,et al.  Improving Predicition of Protein Secondary Structure Using Structured Neural Networks and Multiple Sequence Alignments , 1996, J. Comput. Biol..

[9]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[10]  Jerome R. Bellegarda,et al.  A latent semantic analysis framework for large-Span language modeling , 1997, EUROSPEECH.

[11]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[12]  Jian Cheng,et al.  AIS-BN: An Adaptive Importance Sampling Algorithm for Evidential Reasoning in Large Bayesian Networks , 2000, J. Artif. Intell. Res..

[13]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[14]  Leslie Pack Kaelbling,et al.  Adaptive Importance Sampling for Estimation in Structured Domains , 2000, UAI.

[15]  Wei Xu,et al.  Can artificial neural networks learn language models? , 2000, INTERSPEECH.

[16]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[17]  Søren Riis,et al.  Self-organizing letter code-book for text-to-phoneme neural network model , 2000, INTERSPEECH.

[18]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[19]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[20]  Geoffrey E. Hinton,et al.  Learning Distributed Representations of Concepts Using Linear Relational Embedding , 2001, IEEE Trans. Knowl. Data Eng..

[21]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[22]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[23]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[24]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[25]  Mary P. Harper,et al.  The SuperARV Language Model: Investigating the Effectiveness of Tightly Integrating Multiple Knowledge Sources , 2002, EMNLP.

[26]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Ahmad Emami,et al.  Training Connectionist Models for the Structured Language Model , 2003, EMNLP.

[28]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[29]  H. Schwenk,et al.  Efficient training of large neural networks for language modeling , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[30]  Max Welling Donald,et al.  Products of Experts , 2007 .