Applications of broad class knowledge for noise robust speech recognition

This thesis introduces a novel technique for noise robust speech recognition by first describing a speech signal through a set of broad speech units, and then conducting a more detailed analysis from these broad classes. These classes are formed by grouping together parts of the acoustic signal that have similar temporal and spectral characteristics, and therefore have much less variability than typical sub-word units used in speech recognition (i.e., phonemes, acoustic units). We explore broad classes formed along phonetic and acoustic dimensions. This thesis first introduces an instantaneous adaptation technique to robustly recognize broad classes in the input signal. Given an initial set of broad class models and input speech data, we explore a gradient steepness metric using the Extended Baum-Welch (EBW) transformations to explain how much these initial model must be adapted to fit the target data. We incorporate this gradient metric into a Hidden Markov Model (HMM) framework for broad class recognition and illustrate that this metric allows for a simple and effective adaptation technique which does not suffer from issues such as data scarcity and computational intensity that affect other adaptation methods such as Maximum a-Posteriori (MAP), Maximum Likelihood Linear Regression (MLLR) and feature-space Maximum Likelihood Linear Regression (fM-LLR). Broad class recognition experiments indicate that the EBW gradient metric method outperforms the standard likelihood technique, both when initial models are adapted via MLLR and without adaptation. Next, we explore utilizing broad class knowledge as a pre-processor for segment-based speech recognition systems, which have been observed to be quite sensitive to noise. The experiments are conducted with the SUMMIT segment-based speech recognizer, which detects landmarks - representing possible transitions between phonemes - from large energy changes in the acoustic signal. These landmarks are often poorly detected in noisy conditions. We investigate using the transitions between broad classes, which typically occur at areas of large acoustic change in the audio signal, to aid in landmark detection. We also explore broad classes motivated along both acoustic and phonetic dimensions. Phonetic recognition experiments indicate that utilizing either phonetically or acoustically motivated broad classes offers significant recognition improvements compared to the baseline landmark method in both stationary and non-stationary noise conditions. Finally, this thesis investigates using broad class knowledge for island-driven search. Reliable regions of a speech signal, known as islands, carry most information in the signal compared to unreliable regions, known as gaps. Most speech recognizers do not differentiate between island and gap regions during search and as a result most of the search computation is spent in unreliable regions. Island-driven search addresses this problem by first identifying islands in the speech signal and directing the search outwards from these islands. In this thesis, we develop a technique to identify islands from broad classes which have been confidently identified from the input signal. We explore a technique to prune the search space given island/gap knowledge. Finally, to further limit the amount of computation in unreliable regions, we investigate scoring less detailed broad class models in gap regions and more detailed phonetic models in island regions. Experiments on both small and large scale vocabulary tasks indicate that the island-driven search strategy results in an improvement in recognition accuracy and computation tune. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[2]  John Dillenburg,et al.  Techniques for improving the efficiency of heuristic search , 1993 .

[3]  Andrew K. Halberstadt Heterogeneous acoustic measurements and multiple classifiers for speech recognition , 1999 .

[4]  Victor Zue,et al.  Automatic language identification using a segment-based approach , 1993, EUROSPEECH.

[5]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[6]  Renato De Mori,et al.  High performance connected digit recognition using maximum mutual information estimation , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[7]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[8]  Chin-Hui Lee,et al.  Word recognition using whole word and subword models , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[9]  Mark J. F. Gales,et al.  The generation and use of regression class trees for MLLR adaptation , 1996 .

[10]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[11]  Tara N. Sainath,et al.  Acoustic landmark detection and segmentation using the McAulay-Quatieri Sinusoidal Model , 2005 .

[12]  Timothy J. Hazen Visual model structures and synchrony constraints for audio-visual speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Tara N. Sainath,et al.  Broad phonetic class recognition in a Hidden Markov model framework using extended Baum-Welch transformations , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[14]  Xiaodong Cui,et al.  MMSE-based stereo feature stochastic mapping for noise robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[16]  Lawrence K. Saul,et al.  Comparison of Large Margin Training to Other Discriminative Methods for Phonetic Recognition by Hidden Markov Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[18]  Giorgio Satta,et al.  Computation of Probabilities for an Island-Driven Parser , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[20]  Min Tang,et al.  Modeling linguistic features in speech recognition , 2003, INTERSPEECH.

[21]  Biing-Hwang Juang,et al.  Speech recognition in adverse environments , 1991 .

[22]  Benoît Maison,et al.  Toward island-of-reliability-driven very-large-vocabulary on-line handwriting recognition using character confidence scoring , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[23]  R Carlson,et al.  Quarterly Progress and Status Report Phonetic and orthographic properties of the basic vocabulary of five European languages , 2007 .

[24]  P. P. Chakrabarti,et al.  Heuristic Search Through Islands , 1986, Artif. Intell..

[25]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[26]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[27]  Dimitri Kanevsky Extended Baum transformations for general functions , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[29]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[30]  James R. Glass,et al.  HETEROGENEOUS ACOUSTIC MEASUREMENTS FOR PHONETIC CLASSIFICATION , 1997 .

[31]  Etienne Barnard,et al.  Analysis of phoneme-based features for language identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Chi-youn Park,et al.  Consonant landmark detection for speech recognition , 2008 .

[33]  Victor Zue,et al.  A model of lexical access from partial phonetic information , 1984, ICASSP.

[34]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[35]  Sherif Abdou,et al.  Beam search pruning in speech recognition using a posterior probability-based confidence measure , 2004, Speech Commun..

[36]  A. House,et al.  Toward automatic identification of the language of an utterance. I. Preliminary methodological con , 1977 .

[37]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[38]  Tara N. Sainath,et al.  Unsupervised Audio Segmentation using Extended Baum-Welch Transformations , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[39]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[40]  Stephanie Seneff,et al.  Lexical stress modeling for improved speech recognition of spontaneous telephone speech in the jupiter domain , 2001, INTERSPEECH.

[41]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[42]  Richard M. Stern,et al.  Sources of degradation of speech recognition in the telephone network , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  Günther Ruske,et al.  A confidence-guided dynamic pruning approach - utilization of confidence measurement in speech recognition , 2005, INTERSPEECH.

[44]  H. Pick,et al.  Inhibiting the Lombard effect. , 1989, The Journal of the Acoustical Society of America.

[45]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[46]  J. Wolf,et al.  The HWIM speech understanding system , 1977 .

[47]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[48]  William J. Byrne,et al.  Noise robustness in the auditory representation of speech signals , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[49]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[50]  Victor Zue,et al.  On organic interfaces , 2007, INTERSPEECH.

[51]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[52]  Tara N. Sainath,et al.  Audio classification using extended baum-welch transformations , 2007, INTERSPEECH.

[53]  Michael Picheny,et al.  Influence of background noise and microphone on the performance of the IBM Tangora speech recognition system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  Abeer Alwan,et al.  On the use of variable frame rate analysis in speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[55]  G. A. Miller,et al.  An Analysis of Perceptual Confusions Among Some English Consonants , 1955 .

[56]  Brian Kingsbury,et al.  Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[57]  Tara N. Sainath,et al.  Gradient steepness metrics using extended Baum-Welch transformations for universal pattern recognition tasks , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  VargaAndrew,et al.  Assessment for automatic speech recognition II , 1993 .

[59]  Yuqing Gao,et al.  Noise reduction and speech recognition in noise conditions tested on LPNN-based continuous speech recognition system , 1993, EUROSPEECH.

[60]  A. Nadas,et al.  A generalization of the Baum algorithm to rational objective functions , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[61]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[62]  Tara N. Sainath,et al.  A Sinusoidal Model Approach to Acoustic Landmark Detection and Segmentation for Robust Segment-Based Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[63]  V.W. Zue,et al.  The use of speech knowledge in automatic speech recognition , 1985, Proceedings of the IEEE.

[64]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[65]  Daniel P. W. Ellis,et al.  Using Broad Phonetic Group Experts for Improved Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[66]  W. A. Woods,et al.  Language processing for speech understanding , 1986 .

[67]  Timothy J. Hazen,et al.  Pronunciation modeling using a finite-state transducer representation , 2005, Speech Commun..

[68]  Jeff A. Bilmes,et al.  Attention shift decoding for conversational speech recognition , 2007, INTERSPEECH.

[69]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[70]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[71]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[72]  Jordan Cohen,et al.  The GALE project: A description and an update , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[73]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[74]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[75]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[76]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[77]  Victor Zue,et al.  A* word network search for continuous speech recognition , 1993, EUROSPEECH.

[78]  James R. Glass Finding acoustic regularities in speech: applications to phonetic recognition , 1988 .

[79]  Wayne A. Lea,et al.  Trends in Speech Recognition , 1980 .

[80]  Timothy J. Hazen,et al.  Word and phone level acoustic confidence scoring , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[81]  Stephanie Seneff,et al.  Two-stage continuous speech recognition using feature-based models: a preliminary study , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[82]  Kenneth N Stevens,et al.  Toward a model for lexical access based on acoustic landmarks and distinctive features. , 2002, The Journal of the Acoustical Society of America.

[83]  Tara N. Sainath,et al.  A generalized family of parameter estimation techniques , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[84]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[85]  Hakan Erdogan,et al.  Incremental on-line feature space MLLR adaptation for telephony speech recognition , 2002, INTERSPEECH.

[86]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[87]  Mukund Padmanabhan,et al.  Maximum-likelihood nonlinear transformation for acoustic adaptation , 2004, IEEE Transactions on Speech and Audio Processing.

[88]  James R. Glass,et al.  Real-time probabilistic segmentation for segment-based speech recognition , 1998, ICSLP.

[89]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[90]  James R. Glass,et al.  Real-time telephone-based speech recognition in the Jupiter domain , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[91]  James R. Glass,et al.  Segmentation and modeling in segment-based recognition , 1997, EUROSPEECH.

[92]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[93]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[94]  Jont B. Allen How do humans process and recognize speech , 1993 .