Deep and Wide: Multiple Layers in Automatic Speech Recognition

This paper reviews a line of research carried out over the last decade in speech recognition assisted by discriminatively trained, feedforward networks. The particular focus is on the use of multiple layers of processing preceding the hidden Markov model based decoding of word sequences. Emphasis is placed on the use of multiple streams of highly dimensioned layers, which have proven useful for this purpose. This paper ultimately concludes that while the deep processing structures can provide improvements for this genre, choice of features and the structure with which they are incorporated, including layer width, can also be significant factors.

[1]  C. D. Forgie,et al.  Automatic Recognition of Spoken Digits , 1958 .

[2]  Gunnar Fant Speech research in perspective , 1989, Speech Commun..

[3]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[4]  S. S. Viglione,et al.  Applications of pattern recognition technology , 1994 .

[5]  Philip C. Woodland,et al.  Speaker adaptation of continuous density HMMs using multivariate linear regression , 1994, ICSLP.

[6]  Jordan Cohen,et al.  Vocal tract normalization in speech recognition: Compensating for systematic speaker variability , 1995 .

[7]  Lin Lawrance Chase Error-responsive feedback mechanisms for speech recognizers , 1997 .

[8]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[9]  Daniel P. W. Ellis,et al.  Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[12]  Richard M. Stern,et al.  Likelihood-maximizing beamforming for robust hands-free speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[13]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[14]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[15]  Andreas Stolcke,et al.  Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Andreas Stolcke,et al.  Combining Discriminative Feature, Transform, and Model Training for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Fabio Valente,et al.  Hierarchical and parallel processing of modulation spectrum for ASR applications , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  James Glass,et al.  Research Developments and Directions in Speech Recognition and Understanding, Part 1 , 2009 .

[20]  Nelson Morgan,et al.  Multi-stream to many-stream: using spectro-temporal features for ASR , 2009, INTERSPEECH.

[21]  James R. Glass,et al.  Developments and directions in speech recognition and understanding, Part 1 [DSP Education] , 2009, IEEE Signal Processing Magazine.

[22]  Mathew Magimai-Doss,et al.  Hierarchical processing of the modulation spectrum for GALE Mandarin LVCSR system , 2009, INTERSPEECH.

[23]  James R. Glass,et al.  Updated Minds Report on Speech Recognition and Understanding, Part 2 Citation Baker, J. Et Al. " Updated Minds Report on Speech Recognition and Understanding, Part 2 [dsp Education]. " Signal Processing Accessed Terms of Use , 2022 .

[24]  Nelson Morgan,et al.  RETRACTED: Toward a many-stream framework of cortically-inspired spectro-temporal modulation features for automatic speech recognition , 2010 .

[25]  Larry Gillick,et al.  Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve? , 2010, ArXiv.

[26]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[27]  Joel Praveen Pinto,et al.  Multilayer Perceptron Based Hierarchical Acoustic Modeling for Automatic Speech Recognition , 2010 .

[28]  Richard M. Stern,et al.  Learning-based auditory encoding for robust speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.