Trends and advances in speech recognition

One of the earliest successful applications of machine-learning techniques to pattern recognition was the application of information-theoretic principles to speech recognition. Previous approaches relied heavily on expert input through the painstaking analysis of data to relate speech signals to the word sequences that produced them. Such methodologies were completely displaced by casting the speech recognition problem in a probabilistic framework by modeling the joint probability distribution of speech signals and word sequences. At the beginning of the 21st century, the amount of data and computation to train and build models has increased exponentially, and the emergence of new machine-learning algorithms and methodologies has opened new vistas in approaching complex pattern recognition problems. This is enabled by a new set of machine-learning techniques referred to as graphical models, with computationally tractable training algorithms. Closely related are neural-network modeling techniques, and there has been a resurgence of interest in the application of neural-network concepts, such as deep networks to speech recognition. The explosion of data has caused the development of new ways to capture the key features in massive amounts of data using efficient methods deploying exemplar-based sparse representations. Lastly, all of these different approaches can be tied together in a principled fashion using another variation of graphical models: an exponential model framework. This paper describes the current state of the art in speech recognition systems and highlights the developments that are expected to produce major breakthroughs in our ability to automatically recognize speech using computers.

[1]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[2]  Vaibhava Goel,et al.  Acoustic modeling using exponential families , 2009, INTERSPEECH.

[3]  John Cocke,et al.  Optimal decoding of linear codes for minimizing symbol error rate (Corresp.) , 1974, IEEE Trans. Inf. Theory.

[4]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[5]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[9]  Ahmad Emami,et al.  Syntactic features for Arabic speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[11]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[12]  Enrico Bocchieri,et al.  Vector quantization for the efficient computation of continuous density likelihoods , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[15]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[16]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  J. Baker Trainable grammars for speech recognition , 1979 .

[19]  John R. Hershey,et al.  Hierarchical variational loopy belief propagation for multi-talker speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[20]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  J. S. Bridle,et al.  An Alphanet approach to optimising input transformations for continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[22]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[23]  Tara N. Sainath,et al.  Incorporating sparse representation phone identification features in automatic speech recognition using exponential families , 2010, INTERSPEECH.

[24]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[25]  Lalit R. Bahl,et al.  Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition , 1975, IEEE Trans. Inf. Theory.

[26]  Geoffrey Zweig,et al.  SCARF: a segmental conditional random field toolkit for speech recognition , 2010, INTERSPEECH.

[27]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[28]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Tara N. Sainath,et al.  A convex hull approach to sparse representations for exemplar-based speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[30]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[31]  Tara N. Sainath,et al.  Exemplar-based Sparse Representation phone identification features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[33]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[34]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[35]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[36]  Li Deng,et al.  Enhancement of log Mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise , 2004, IEEE Transactions on Speech and Audio Processing.

[37]  Geoffrey Zweig,et al.  Bayesian network structures and inference techniques for automatic speech recognition , 2003, Comput. Speech Lang..

[38]  Tara N. Sainath,et al.  An analysis of sparseness and regularization in exemplar-based methods for speech classification , 2010, INTERSPEECH.

[39]  Philip C. Woodland,et al.  Speaker adaptation of continuous density HMMs using multivariate linear regression , 1994, ICSLP.

[40]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[41]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[42]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[43]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[44]  Tara N. Sainath,et al.  Bayesian compressive sensing for phonetic classification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[46]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[47]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[48]  Yochai Konig,et al.  Remap: recursive estimation and maximization of a posteriori probabilities in transition-based speech recognition , 1996 .

[49]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[50]  Jen-Tzung Chien,et al.  Bayesian Sensing Hidden Markov Models , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  John R. Hershey,et al.  Single-channel speech separation and recognition using loopy belief propagation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[52]  Wen Wang,et al.  Building a highly accurate Mandarin speech recognizer , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[53]  R. Bakis,et al.  Spoken Digit Recognition Using Vowel‐Consonant Segmentation , 1962 .

[54]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[55]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[56]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[57]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[58]  Georg Heigold,et al.  Speech recognition with state-based nearest neighbour classifiers , 2007, INTERSPEECH.

[59]  Robert M. Gray,et al.  Image classification by a two-dimensional hidden Markov model , 2000, IEEE Trans. Signal Process..

[60]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[61]  Michael Picheny,et al.  Speech recognition using noise-adaptive prototypes , 1989, IEEE Trans. Acoust. Speech Signal Process..

[62]  Tuomas Virtanen,et al.  Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[63]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[64]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[65]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[66]  Matthias W. Seeger,et al.  Bayesian Gaussian process models : PAC-Bayesian generalisation error bounds and sparse approximations , 2003 .

[67]  Brendan J. Frey,et al.  ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition , 2001, NIPS.

[68]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[69]  K. Davis,et al.  Automatic Recognition of Spoken Digits , 1952 .

[70]  Qing Li,et al.  The Bayesian elastic net , 2010 .

[71]  Finn Tore Johansen,et al.  A comparison of hybrid HMM architecture using global discriminating training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[72]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[73]  James R. Glass,et al.  Hidden feature models for speech recognition using dynamic Bayesian networks , 2003, INTERSPEECH.