论文信息 - Neural Probabilistic Language Models

Neural Probabilistic Language Models

A central goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on several methods to speed-up both training and probability computation, as well as comparative experiments to evaluate the improvements brought by these techniques. We finally describe the incorporation of this new language model into a state-of-the-art speech recognizer of conversational speech.

[1] Frederick Jelinek,et al. Interpolated estimation of Markov source parameters from sparse data , 1980 .

[2] Jack Perkins,et al. Pattern recognition in practice , 1980 .

[3] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[4] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5] Geoffrey E. Hinton. Learning distributed representations of concepts. , 1989 .

[6] Yoav Freund,et al. Boosting a weak learning algorithm by majority , 1995, COLT '90.

[7] Jeffrey L. Elman,et al. Finding Structure in Time , 1990, Cogn. Sci..

[8] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[9] Risto Miikkulainen,et al. Natural Language Processing With Modular PDP Networks and Distributed Lexicon , 1991, Cogn. Sci..

[10] Richard Shillcock,et al. Proceedings of EUROSPEECH-1991. , 1991 .

[11] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[12] Hinrich Schütze,et al. Word Space , 1992, NIPS.

[13] Naftali Tishby,et al. Distributional Clustering of English Words , 1993, ACL.

[14] Hermann Ney,et al. Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[15] Jun S. Liu,et al. Sequential Imputations and Bayesian Missing Data Problems , 1994 .

[16] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[17] Anders Krogh,et al. Improving Predicition of Protein Secondary Structure Using Structured Neural Networks and Multiple Sequence Alignments , 1996, J. Comput. Biol..

[18] Jürgen Schmidhuber,et al. Sequential neural text compression , 1996, IEEE Trans. Neural Networks.

[19] Adam L. Berger,et al. A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[20] Jerome R. Bellegarda,et al. A latent semantic analysis framework for large-Span language modeling , 1997, EUROSPEECH.

[21] James Demmel,et al. Using PHiPAC to speed error back-propagation learning , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22] Thomas Niesler,et al. Comparison of part-of-speech and automatically derived category-based language models for speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[23] Andrew McCallum,et al. Distributional clustering of words for text classification , 1998, SIGIR '98.

[24] Samy Bengio,et al. Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks , 1999, NIPS.

[25] Stanley F. Chen,et al. An empirical study of smoothing techniques for language modeling , 1999 .

[26] Frederick Jelinek,et al. Improved clustering techniques for class-based statistical language modeling , 1999 .

[27] Jian Cheng,et al. AIS-BN: An Adaptive Importance Sampling Algorithm for Evidential Reasoning in Large Bayesian Networks , 2000, J. Artif. Intell. Res..

[28] Samy Bengio,et al. Taking on the curse of dimensionality in joint distributions using neural networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[29] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[30] Geoffrey E. Hinton,et al. Extracting distributed representations of concepts and relations from positive and negative propositions , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[31] Leslie Pack Kaelbling,et al. Adaptive Importance Sampling for Estimation in Structured Domains , 2000, UAI.

[32] Wei Xu,et al. Can artificial neural networks learn language models? , 2000, INTERSPEECH.

[33] Søren Riis,et al. Self-organizing letter code-book for text-to-phoneme neural network model , 2000, INTERSPEECH.

[34] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[35] Joshua Goodman,et al. A bit of progress in language modeling , 2001, Comput. Speech Lang..

[36] Joshua Goodman,et al. Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[37] Geoffrey E. Hinton,et al. Products of Hidden Markov Models , 2001, AISTATS.

[38] Jun S. Liu,et al. Monte Carlo strategies in scientific computing , 2001 .

[39] Geoffrey E. Hinton,et al. Stochastic Neighbor Embedding , 2002, NIPS.

[40] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[41] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[42] Thomas G. Dietterich,et al. Editors. Advances in Neural Information Processing Systems , 2002 .

[43] Jean-Luc Gauvain,et al. Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44] Ahmad Emami,et al. Using a connectionist model in a syntactical based language model , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[45] Jean-Luc Gauvain,et al. Conversational telephone speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[46] Holger Schwenk,et al. USING CONTINUOUS SPACE LANGUAGE MODELS FOR CONVERSATIONAL SPEECH RECOGNITION , 2003 .

[47] John Blitzer,et al. Hierarchical Distributed Representations for Statistical Language Modeling , 2004, NIPS.

[48] Christian P. Robert,et al. Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[49] H. Schwenk,et al. Efficient training of large neural networks for language modeling , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[50] Jean-Luc Gauvain,et al. Neural network language models for conversational speech recognition , 2004, INTERSPEECH.

[51] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.