Discriminant Training of Front-End and Acoustic Modeling Stages to Heterogeneous Acoustic Environmen

Automatic Speech Recognition (ASR) still poses a problem to researchers. In particular, most ASR systems have not been able to fully handle adverse acoustic environments. Although a large number of modifications have resulted in increased levels of performance robustness, ASR systems still fall short of human recognition ability in a large number of environments. A possible shortcoming of the typical ASR system is the reliance on a single stream of front-end acoustic features and acoustic modeling feature probabilities. A single front-end feature extraction algorithm may not be capable of maintaining robustness to arbitrary acoustic environments. Acoustic modeling will also degrade due to distributional changes caused by the acoustic environment. This thesis explores the parallel use of multiple front-end and acoustic modeling elements to improve upon this shortcoming. Each ASR acoustic modeling component is trained to estimate class posterior probabilities in a particular acoustic environment. In addition to discriminative training of the probability estimator, existing feature extraction algorithms are modified in such a way as to improve class discrimination in the training environment. More specifically, Linear Discriminant Analysis provides a mechanism for obtaining discriminant temporal basis functions that can replace components of the existing algorithms that were designed in either an empirical or intuitive manner. Probability streams are generated using multiple front-end acoustic modeling stages trained to heterogeneous acoustic environments. In new sample acoustic environments, simple combinations of these probability streams give rise to word recognition rates that are superior to the individual streams.

[1]  Steven Greenberg,et al.  Temporal constraints on speech intelligibility as deduced from exceedingly sparse spectral representations , 1999, EUROSPEECH.

[2]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[3]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[4]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[5]  Steven Greenberg,et al.  Speech intelligibility in the presence of cross-channel spectral asynchrony , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  Reinhold Häb-Umbach,et al.  LDA derived cepstral trajectory filters in adverse environmental conditions , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Nikki Mirghafori,et al.  Transmissions and transitions: a study of two common assumptions in multi-band ASR , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[9]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[10]  Steven Greenberg,et al.  Integrating syllable boundary information into speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[12]  Misha Pavel,et al.  Towards ASR on partially corrupted speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  Brian Hanson,et al.  Robust speaker-independent word recognition using static, dynamic and acceleration features: experiments with Lombard and noisy speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[14]  Alexander H. Waibel,et al.  Improving connected letter recognition by lipreading , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  M. M. Cohen,et al.  What can visual speech synthesis tell visual speech recognition? , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[16]  Barry Y. Chen,et al.  On data-derived temporal processing in speech feature extraction , 2000, INTERSPEECH.

[17]  Brian Kingsbury,et al.  Spert-II: A Vector Microprocessor System , 1996, Computer.

[18]  Sarel van Vuuren,et al.  Data based filter design for RASTA-like channel normalization in ASR , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[19]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[20]  Gethin Williams,et al.  Knowing What You Don't Know: Roles for Confidence Measures in Automatic Speech Recognition , 1999 .

[21]  Steven Greenberg,et al.  Performance improvements through combining phone- and syllable-scale information in automatic speech recognition , 1998, ICSLP.

[22]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  E. Zwicker,et al.  Analytical expressions for critical‐band rate and critical bandwidth as a function of frequency , 1980 .

[24]  S. Howard Bartley,et al.  The relation of pitch to frequency. , 1950 .

[25]  Jeff A. Bilmes,et al.  Joint distributional modeling with cross-correlation based features , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[26]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[27]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[28]  Sarel van Vuuren,et al.  Relevance of time-frequency features for phonetic and speaker-channel classification , 2000, Speech Commun..

[29]  Sarel van Vuuren,et al.  Speaker verification in a time-feature space , 1999 .

[30]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[31]  R Drullman,et al.  Temporal envelope and fine structure cues for speech intelligibility. , 1994, The Journal of the Acoustical Society of America.

[32]  Hynek Hermansky,et al.  Sub-band based recognition of noisy speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Christos Andrea Antoniou,et al.  Acoustic modelling using modular/ensemble combinations of heterogeneous neural networks , 2000, INTERSPEECH.

[34]  C. B. Pedersen,et al.  Temporal Factors in Speech Perception , 1982 .

[35]  Hynek Hermansky,et al.  Data-driven methods for extracting features from speech , 2000 .

[36]  Sangita R. Sharma,et al.  Multi-stream approach to robust speech recognition , 1999 .

[37]  Steve R. Waterhouse,et al.  Ensemble Methods for Phoneme Classification , 1996, NIPS.

[38]  Nelson Morgan,et al.  Perceptually inspired signal processing strategies for robust speech recognition in reverberant environments , 1998 .

[39]  Liang Zhou,et al.  Chinese all syllables recognition using combination of multiple classifiers , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[40]  H. Hermansky,et al.  Syllable intelligibility for temporally filtered LPC cepstral trajectories. , 1999, The Journal of the Acoustical Society of America.

[41]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[43]  Harvey b. Fletcher,et al.  Speech and hearing in communication , 1953 .

[44]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[45]  Yi Lu Integration of knowledge in a multiple classifier system , 1994, IEA/AIE '94.

[46]  K. Wang,et al.  Auditory analysis of spectro-temporal information in acoustic signals , 1995 .

[47]  Steven Greenberg,et al.  AN INTRODUCTION TO THE DIAGNOSTIC EVALUATION OF SWITCHBOARD-CORPUS AUTOMATIC SPEECH RECOGNITION SYSTEMS , 2000 .

[48]  Manfred R. Schroeder,et al.  Computer Speech: Recognition, Compression, Synthesis , 1999 .

[49]  L. R. Rabiner,et al.  Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[50]  Steven Greenberg,et al.  Speech intelligibility derived from exceedingly sparse spectral information , 1998, ICSLP.

[51]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[52]  Alex Waibel,et al.  Bimodal sensor integration on the example of 'speechreading' , 1993, IEEE International Conference on Neural Networks.

[53]  Sridha Sridharan,et al.  Telephone based speaker recognition using multiple binary classifier and Gaussian mixture models , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  Shihab A. Shamma Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method , 1996 .

[55]  Steven Greenberg,et al.  Automatic phonetic transcription of spontaneous speech (american English) , 2000, INTERSPEECH.

[56]  Hervé Bourlard,et al.  Parallel training of MLP probability estimators for speech recognition: a gender-based approach , 1994, Proceedings of IEEE Workshop on Neural Networks for Signal Processing.

[57]  Steven Greenberg,et al.  Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[58]  Homer Dudley,et al.  A Synthetic Speaker , 1939, Science.

[59]  Jeff A. Bilmes,et al.  Directed graphical models of classifier combination: application to phone recognition , 2000, INTERSPEECH.

[60]  M. L. Shire,et al.  Data-driven modulation filter design under adverse acoustic conditions and using phonetic and syllabic units , 1999, EUROSPEECH.

[61]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[62]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[63]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[64]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[65]  Daniel P. W. Ellis,et al.  Multi-stream speech recognition: ready for prime time? , 1999, EUROSPEECH.

[66]  Brian Kingsbury,et al.  Recognizing reverberant speech with RASTA-PLP , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[67]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[68]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[69]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[70]  Ethem Alpaydin,et al.  Combining multiple representations and classifiers for pen-based handwritten digit recognition , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[71]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[72]  Jean-Claude Junqua,et al.  The Lombard effect: a reflex to better communicate with others in noise , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[73]  Volker Tresp,et al.  Combining Estimators Using Non-Constant Weighting Functions , 1994, NIPS.

[74]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[75]  Barry Y. Chen,et al.  Data-driven RASTA filters in reverberation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[76]  K. Hubener,et al.  Using multi-level segmentation coefficients to improve HMM speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[77]  D. D. Greenwood Critical Bandwidth and the Frequency Coordinates of the Basilar Membrane , 1961 .

[78]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[79]  Yochai Konig,et al.  A hybrid approach to bimodal speech recognition , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[80]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[81]  Sarel van Vuuren,et al.  Relevancy of time-frequency features for phonetic classification measured by mutual information , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[82]  Steven Greenberg,et al.  THE SIGNIFICANCE OF THE COCHLEAR TRAVELING WAVE FOR THEORIES OF FREQUENCY ANALYSIS AND PITCH , 1997 .

[83]  Daniel P. W. Ellis,et al.  Using mutual information to design feature combinations , 2000, INTERSPEECH.

[84]  Pieter J. E. Vermeulen,et al.  Combining Information from Multiple Classifiers for Speaker Verification , 1998 .

[85]  M. L. Shire Syllable onset detection from acous-tics , 1997 .

[86]  Dominic W. Massaro,et al.  Auditory/visual speech in multimodal human interfaces , 1994, ICSLP.

[87]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[88]  Steven Greenberg,et al.  LINGUISTIC DISSECTION OF SWITCHBOARD-CORPUS AUTOMATIC SPEECH RECOGNITION SYSTEMS , 2000 .

[89]  Steven Greenberg,et al.  Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation , 1999, Speech Commun..

[90]  John H. L. Hansen,et al.  Lombard effect compensation for robust automatic speech recognition in noise , 1990, ICSLP.

[91]  Nikki Mirghafori,et al.  Combining connectionist multi-band and full-band probability streams for speech recognition of natural numbers , 1998, ICSLP.

[92]  H. Hermansky,et al.  The modulation spectrum in the automatic recognition of speech , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[93]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[94]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[95]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[96]  Daniel P. W. Ellis,et al.  Feature extraction using non-linear transformation for robust speech recognition on the Aurora database , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).