Parts-based models and local features for automatic speech recognition

While automatic speech recognition (ASR) systems have steadily improved and are now in widespread use, their accuracy continues to lag behind human performance, particularly in adverse conditions. This thesis revisits the basic acoustic modeling assumptions common to most ASR systems and argues that improvements to the underlying model of speech are required to address these shortcomings. A number of problems with the standard method of hidden Markov models (HMMs) and features derived from fixed, frame-based spectra (e.g. MFCCs) are discussed. Based on these problems, a set of desirable properties of an improved acoustic model are proposed, and we present a "parts-based" framework as an alternative. The parts-based model (PBM), based on previous work in machine vision, uses graphical models to represent speech with a deformable template of spectro-temporally localized "parts", as opposed to modeling speech as a sequence of fixed spectral profiles. We discuss the proposed model's relationship to HMMs and segment-based recognizers, and describe how they can be viewed as special cases of the PBM. Two variations of PBMs are described in detail. The first represents each phonetic unit with a set of time-frequency (T-F) "patches" which act as filters over a spectrogram. The model structure encodes the patches' relative T-F positions. The second variation, referred to as a 'speech schematic" model, more directly encodes the information in a spectrogram by using simple edge detectors and focusing more on modeling the constraints between parts. We demonstrate the proposed models on various isolated recognition tasks and show the benefits over baseline systems, particularly in noisy conditions and when only limited training data is available. We discuss efficient implementation of the models and describe how they can be combined to build larger recognition systems. It is argued that the flexible templates used in parts-based modeling may provide a better generative model of speech than typical HMMs. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[2]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[3]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[4]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[5]  Darryl Stewart,et al.  Subband correlation and robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[6]  Yali Amit,et al.  Robust acoustic object detection. , 2005, The Journal of the Acoustical Society of America.

[7]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[8]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[9]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Karen Livescu,et al.  Feature-based pronunciation modeling for automatic speech recognition , 2005 .

[11]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[12]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[13]  Tomaso A. Poggio,et al.  Example-Based Object Detection in Images by Components , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[15]  Daniel P. W. Ellis,et al.  Towards single-channel unsupervised source separation of speech mixtures: the layered harmonics/formants separation-tracking model , 2004, SAPA@INTERSPEECH.

[16]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17]  Ljubomir Josifovski,et al.  Robust Automatic Speech Recognition with Missing and Unreliable Data , 2003 .

[18]  Michael I. Jordan,et al.  Boltzmann Chains and Hidden Markov Models , 1994, NIPS.

[19]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[20]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[21]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[22]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[23]  V.W. Zue,et al.  The use of speech knowledge in automatic speech recognition , 1985, Proceedings of the IEEE.

[24]  Tony Ezzat,et al.  Discriminative word-spotting using ordered spectro-temporal patch features , 2008, SAPA@INTERSPEECH.

[25]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[26]  N. One,et al.  Explicit Duration Modelling in HMM / ANN Hybrids , 2005 .

[27]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[28]  Jessika Eichel,et al.  FUNDAMENTALS OF HEARING: AN INTRODUCTION , 1978, The Ulster Medical Journal.

[29]  Ning Ma,et al.  Exploiting correlogram structure for robust speech recognition with multiple speech sources , 2007, Speech Commun..

[30]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[31]  A. Liberman,et al.  Some Cues for the Distinction Between Voiced and Voiceless Stops in Initial Position , 1957 .

[32]  Andrew K. Halberstadt Heterogeneous acoustic measurements and multiple classifiers for speech recognition , 1999 .

[33]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[34]  Michael I. Jordan,et al.  Graphical models: Probabilistic inference , 2002 .

[35]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[36]  Lori F Lamei Formalizing knowledge used in spectrogram reading : acoustic and perceptual evidence from stops , 1988 .

[37]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[38]  James R. Glass,et al.  Speech recognition with localized time-frequency pattern detectors , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[39]  Odette Scharenborg,et al.  Comparing human and machine recognition performance on a VCV corpus , 2008 .

[40]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[41]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[42]  A. Liberman,et al.  The role of selected stimulus-variables in the perception of the unvoiced stop consonants. , 1952, The American journal of psychology.

[43]  Kuansan Wang,et al.  Spectral shape analysis in the central auditory system , 1995, IEEE Trans. Speech Audio Process..

[44]  Odette Scharenborg,et al.  Reaching over the gap: A review of efforts to link human and automatic speech recognition research , 2007, Speech Commun..

[45]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[46]  Lori Lamel,et al.  Formalizing knowledge used in spectrogram reading: acoustic and perceptual evidence from stops , 1988 .

[47]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[48]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[49]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[50]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[51]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[52]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[53]  William T. Freeman,et al.  Understanding belief propagation and its generalizations , 2003 .

[54]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[55]  Hynek Hermansky,et al.  Towards increasing speech recognition error rates , 1995, Speech Commun..

[56]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[57]  Roy D. Patterson,et al.  A Dynamic Compressive Gammachirp Auditory Filterbank , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  David Gelbart,et al.  Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[59]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[60]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[61]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[62]  Ron Cole,et al.  The ISOLET spoken letter database , 1990 .

[63]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[64]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[65]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[66]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[67]  Odette Scharenborg,et al.  The interspeech 2008 consonant challenge , 2008, INTERSPEECH.

[68]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[69]  C. D. Forgie,et al.  Automatic Recognition of Spoken Digits , 1958 .

[70]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[71]  Victor W. Zue,et al.  Visual characterization of speech spectrograms , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[72]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[73]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[74]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[75]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[76]  Martin A. Fischler,et al.  The Representation and Matching of Pictorial Structures , 1973, IEEE Transactions on Computers.

[77]  Antonio Torralba,et al.  Describing Visual Scenes Using Transformed Objects and Parts , 2008, International Journal of Computer Vision.

[78]  Ronald A. Cole,et al.  Performing fine phonetic distinctions: templates versus features , 1990 .

[79]  Simon King,et al.  Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[80]  Alvin M. Liberman,et al.  Speech: A Special Code , 1996 .

[81]  A. Liberman,et al.  Some Experiments on the Perception of Synthetic Speech Sounds , 1952 .

[82]  Richard M. Stern,et al.  Reconstruction of incomplete spectrograms for robust speech recognition , 2000 .

[83]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[84]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[85]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[86]  Roger K. Moore,et al.  Towards capturing fine phonetic variation in speech using articulatory features , 2007, Speech Commun..

[87]  Steve Young,et al.  The HTK book , 1995 .

[88]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[89]  Tony Ezzat,et al.  Localized spectro-temporal cepstral analysis of speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[90]  Hermann Ney,et al.  Improved methods for vocal tract normalization , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[91]  A. Liberman,et al.  Minimal Rules for Synthesizing Speech , 1959 .

[92]  Simon King,et al.  Articulatory feature recognition using dynamic Bayesian networks , 2007, Comput. Speech Lang..

[93]  Richard S. Zemel,et al.  Learning Parts-Based Representations of Data , 2006, J. Mach. Learn. Res..

[94]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[95]  Thomas Serre,et al.  Component-based face detection , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[96]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[97]  Thomas Serre,et al.  Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[98]  R. Rifkin,et al.  Notes on Regularized Least Squares , 2007 .