论文信息 - A Neural Probabilistic Language Model

A Neural Probabilistic Language Model

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

[1] Frederick Jelinek,et al. Interpolated estimation of Markov source parameters from sparse data , 1980 .

[2] Jack Perkins,et al. Pattern recognition in practice , 1980 .

[3] Barbara J. Grosz,et al. Natural-Language Processing , 1982, Artificial Intelligence.

[4] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[5] Geoffrey E. Hinton,et al. Learning distributed representations of concepts. , 1989 .

[6] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[7] Jeffrey L. Elman,et al. Finding Structure in Time , 1990, Cogn. Sci..

[8] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[9] Risto Miikkulainen,et al. Natural Language Processing With Modular PDP Networks and Distributed Lexicon , 1991, Cogn. Sci..

[10] Risto Miikkulainen,et al. Natural Language Processingwith Modular Neural Networks and Distributed Lexicon , 1991 .

[11] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[12] Hinrich Schütze,et al. Word Space , 1992, NIPS.

[13] Naftali Tishby,et al. Distributional Clustering of English Words , 1993, ACL.

[14] Hermann Ney,et al. Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[15] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .

[16] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[17] Anders Krogh,et al. Improving Predicition of Protein Secondary Structure Using Structured Neural Networks and Multiple Sequence Alignments , 1996, J. Comput. Biol..

[18] Jürgen Schmidhuber,et al. Sequential neural text compression , 1996, IEEE Trans. Neural Networks.

[19] Adam L. Berger,et al. A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[20] Jerome R. Bellegarda,et al. A latent semantic analysis framework for large-Span language modeling , 1997, EUROSPEECH.

[21] Thomas Niesler,et al. Comparison of part-of-speech and automatically derived category-based language models for speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[22] Andrew McCallum,et al. Distributional clustering of words for text classification , 1998, SIGIR '98.

[23] Samy Bengio,et al. Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks , 1999, NIPS.

[24] Stanley F. Chen,et al. An empirical study of smoothing techniques for language modeling , 1999 .

[25] Samy Bengio,et al. Taking on the curse of dimensionality in joint distributions using neural networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[26] Geoffrey E. Hinton,et al. Extracting distributed representations of concepts and relations from positive and negative propositions , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[27] Wei Xu,et al. Can artificial neural networks learn language models? , 2000, INTERSPEECH.

[28] Søren Riis,et al. Self-organizing letter code-book for text-to-phoneme neural network model , 2000, INTERSPEECH.

[29] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[30] Joshua Goodman,et al. A bit of progress in language modeling , 2001, Comput. Speech Lang..

[31] Geoffrey E. Hinton,et al. Products of Hidden Markov Models , 2001, AISTATS.

[32] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[33] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[34] Jean-Luc Gauvain,et al. Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.