Statistical Language Models Based on Neural Networks

Statistical language models are crucial part of many successful applications, such as automatic speech recognition and statistical machine translation (for example well-known Google Translate). Traditional techniques for estimating these models are based on N gram counts. Despite known weaknesses of N -grams and huge efforts of research communities across many fields (speech recognition, machine translation, neuroscience, artificial intelligence, natural language processing, data compression, psychology etc.), N -grams remained basically the state-of-the-art. The goal of this thesis is to present various architectures of language models that are based on artificial neural networks. Although these models are computationally more expensive than N -gram models, with the presented techniques it is possible to apply them to state-of-the-art systems efficiently. Achieved reductions of word error rate of speech recognition systems are up to 20%, against stateof-the-art N -gram model. The presented recurrent neural network based model achieves the best published performance on well-known Penn Treebank setup. Kĺıčová slova jazykový model, neuronová śıt’, rekurentńı, maximálńı entropie, rozpoznáváńı řeči, komprese dat, umělá inteligence

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[3]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[4]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[5]  D. Rumelhart Learning internal representations by back-propagating errors , 1986 .

[6]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[7]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[8]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[9]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[11]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[12]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[13]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[15]  Jürgen Schmidhuber,et al.  Sequential neural text compression , 1996, IEEE Trans. Neural Networks.

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  Dietrich Klakow,et al.  Log-linear interpolation of language models , 1998, ICSLP.

[18]  Matthew V. Mahoney,et al.  Text Compression as a Test for Artificial Intelligence , 1999, AAAI/IAAI.

[19]  Douglas L. T. Rohde,et al.  Language acquisition in the absence of explicit negative evidence: how important is starting small? , 1999, Cognition.

[20]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[21]  Matthew V. Mahoney,et al.  Fast Text Compression with Neural Networks , 2000, FLAIRS Conference.

[22]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[23]  Wu Chou,et al.  Robust decision tree state tying for continuous speech recognition , 2000, IEEE Trans. Speech Audio Process..

[24]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[25]  Wei Xu,et al.  Can artificial neural networks learn language models? , 2000, INTERSPEECH.

[26]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[27]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[28]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[29]  Mikael Bodén,et al.  A guide to recurrent neural networks and backpropagation , 2001 .

[30]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[31]  Mary P. Harper,et al.  The SuperARV Language Model: Investigating the Effectiveness of Tightly Integrating Multiple Knowledge Sources , 2002, EMNLP.

[32]  Ahmad Emami,et al.  Exact training of a neural syntactic language model , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Jean-Luc Gauvain,et al.  Training Neural Network Language Models on Very Large Corpora , 2005, HLT.

[34]  Ahmad Emami,et al.  Random clusterings for language modeling , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[35]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[36]  Ahmad Emami,et al.  A Neural Syntactic Language Model , 2005, Machine Learning.

[37]  Katrin Kirchhoff,et al.  Factored Neural Language Models , 2006, NAACL.

[38]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[39]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[40]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[41]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[42]  Peng Xu,et al.  Random forests and the data sparseness problem in language modeling , 2007, Comput. Speech Lang..

[43]  S. Legg Machine super intelligence , 2008 .

[44]  Lukás Burget,et al.  Morphological random forests for language modeling of inflectional languages , 2008, 2008 IEEE Spoken Language Technology Workshop.

[45]  Yoshua Bengio,et al.  Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model , 2008, IEEE Transactions on Neural Networks.

[46]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[47]  Hynek Hermansky,et al.  Combination of strongly and weakly constrained recognizers for reliable detection of OOVS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Sanjeev Khudanpur,et al.  Self-supervised discriminative training of statistical language models , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[49]  Frederick Jelinek,et al.  Iterative decoding: A novel re-scoring framework for confusion networks , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[50]  R. Solomonoff Machine Learning — Past and Future , 2009 .

[51]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[52]  Mary P. Harper,et al.  A Joint Language Model With Fine-grain Syntactic Tags , 2009, EMNLP.

[53]  Ben Goertzel,et al.  Program Representation for General Intelligence , 2009 .

[54]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[55]  Back-off language model compression , 2009, INTERSPEECH.

[56]  Mikko Kurimo,et al.  Efficient estimation of maximum entropy language models with n-gram features: an SRILM extension , 2010, INTERSPEECH.

[57]  Jason Weston,et al.  Towards Understanding Situated Natural Language , 2010, AISTATS.

[58]  Ahmad Emami,et al.  Augmented context features for Arabic speech recognition , 2010, INTERSPEECH.

[59]  Thorsten Brants,et al.  Study on interaction between entropy pruning and kneser-ney smoothing , 2010, INTERSPEECH.

[60]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[61]  Igor Szöke Hybrid word-subword spoken term detection , 2010 .

[62]  Mary P. Harper,et al.  Model combination for Speech Recognition using Empirical Bayes Risk minimization , 2010, 2010 IEEE Spoken Language Technology Workshop.

[63]  Friedrich Faubel,et al.  Within and across sentence boundary language model , 2010, INTERSPEECH.

[64]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[65]  Kenneth Ward Church,et al.  A Fast Re-scoring Strategy to Capture Long-Distance Dependencies , 2011, EMNLP.

[66]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[67]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[68]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[69]  Alexandre Allauzen,et al.  Structured Output Layer neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Sanjeev Khudanpur,et al.  Efficient Subsampling for Training Complex Language Models , 2011, EMNLP.

[71]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[72]  Lukás Burget,et al.  Recurrent Neural Network Based Language Modeling in Meeting Recognition , 2011, INTERSPEECH.

[73]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[74]  Sanjeev Khudanpur,et al.  Variational approximation of long-span language models for lvcsr , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[75]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[76]  Bhuvana Ramabhadran,et al.  Hill climbing on speech lattices: A new rescoring framework , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[77]  Christopher J. C. Burges,et al.  The Microsoft Research Sentence Completion Challenge , 2011 .

[78]  Lukás Burget,et al.  Out-of-Vocabulary Word Detection and Beyond , 2012, Detection and Identification of Rare Audiovisual Cues.