Quick Training of Probabilistic Neural Nets by Importance Sampling

Our previous work on statistical language modeling introduced the use of probabilistic feedforward neural networks to help dealing with the curse of dimensionality. Training this model by maximum likelihood however requires for each example to perform as many network passes as there are words in the vocabulary. Inspired by the contrastive divergence model, we propose and evaluate sampling-based methods which require network passes only for the observed “positive example” and a few sampled negative example words. A very significant speed-up is obtained with an adaptive importance sampling.

[1]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[2]  Rong Chen,et al.  A Theoretical Framework for Sequential Importance Sampling with Resampling , 2001, Sequential Monte Carlo Methods in Practice.

[3]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[4]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[5]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[6]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[7]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[8]  Frederick Jelinek,et al.  Structured language modeling , 2000, Comput. Speech Lang..

[9]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[10]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[11]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[12]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[13]  Guy Lapalme,et al.  Text prediction for translators , 2002 .

[14]  Tom Heskes,et al.  Bias/Variance Decompositions for Likelihood-Based Estimators , 1998, Neural Computation.

[15]  Robin M. Hogarth,et al.  [Combining Probability Distributions: A Critique and an Annotated Bibliography]: Comment , 1986 .

[16]  Samy Bengio,et al.  Taking on the curse of dimensionality in joint distributions using neural networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[17]  Michael I. Jordan,et al.  Exploiting Tractable Substructures in Intractable Networks , 1995, NIPS.