Heterogeneous acoustic measurements and multiple classifiers for speech recognition

The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, fixed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be significantly improved through a contrasting approach using more detailed and more diverse acoustic measurements, which we refer to as heterogeneous measurements. This investigation has three principal goals. The first goal is to develop heterogeneous acoustic measurements to increase the amount of acoustic-phonetic information I extracted from the speech signal. Diverse measurements are obtained by varying the time-frequency resolution, the spectral representation, the choice of temporal basis vectors, and other aspects of the preprocessing of the speech waveform. The second goal is to develop classifier systems for successfully utilizing high-dimensional heterogeneous acoustic measurement spaces. This is accomplished through hierarchical and committee-based techniques for combining multiple classifiers. The third goal is to increase understanding of the weaknesses of current automatic phonetic classification systems. This is accomplished through perceptual experiments on stop consonants which facilitate comparisons between humans and machines. Systems using heterogeneous measurements and multiple classifiers were evaluated in phonetic classification, phonetic recognition, and word recognition tasks. On the TIMIT core test set, these systems achieved error rates of 18.3% and 24.4% for, context-independent phonetic classification and context-dependent phonetic recognition, respectively. These results are the best that we have seen reported on these tasks. Word recognition experiments using the corpus associated with the JUPITER telephone-based weather information system showed 10–16% word error rate reduction, thus demonstrating that these techniques generalize to word recognition in a telephone-bandwidth acoustic environment. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  G. A. Miller,et al.  An Analysis of Perceptual Confusions Among Some English Consonants , 1955 .

[2]  G. Fairbanks Test of Phonemic Differentiation: The Rhyme Test , 1958 .

[3]  George A. Miller,et al.  Decision units in the perception of speech , 1962, IRE Trans. Inf. Theory.

[4]  B. P. Bogert,et al.  The quefrency analysis of time series for echoes : cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking , 1963 .

[5]  K. D. Kryter,et al.  ARTICULATION-TESTING METHODS: CONSONANTAL DIFFERENTIATION WITH A CLOSED-RESPONSE SET. , 1965, The Journal of the Acoustical Society of America.

[6]  Kenneth N. Stevens,et al.  On the quantal nature of speech , 1972 .

[7]  J. Reeds,et al.  Identification of Stops and Vowels for the Burst Portion of /p, t, k/ Isolated from Conversational Speech , 1972 .

[8]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  C. L. Searle,et al.  Stop consonant discrimination based on human audition. , 1979, The Journal of the Acoustical Society of America.

[12]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[13]  R. Lippmann,et al.  Study of multichannel amplitude compression and linear amplification for persons with sensorineural hearing loss. , 1981, The Journal of the Acoustical Society of America.

[14]  Lawrence R. Rabiner,et al.  Isolated word recognition using a two-pass pattern recognition approach , 1981, ICASSP.

[15]  Ronald A. Cole,et al.  A comparison of learning techniques in speech recognition , 1982, ICASSP.

[16]  Victor Zue,et al.  Performance improvement in a dynamic-programming-based isolated word recognition system for the alpha-digit task , 1982, ICASSP.

[17]  John E. Clark Intelligibility comparisons for two synthetic and one natural speech source , 1983 .

[18]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[19]  Nancy A. Daly,et al.  Recognition of words from their spellings : integration of multiple knowledge sources , 1987 .

[20]  Lori F Lamei Formalizing knowledge used in spectrogram reading : acoustic and perceptual evidence from stops , 1988 .

[21]  L. Lamel Formalizing knowledge used in spectrogram reading: acoustic and perceptual evidence from stops , 1988 .

[22]  James R. Glass Finding acoustic regularities in speech: applications to phonetic recognition , 1988 .

[23]  Chung Leung Hong The use of artificial neural networks for phonetic recognition , 1989 .

[24]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[25]  Victor Zue,et al.  The MIT SUMMIT Speech Recognition System: A Progress Report , 1989, HLT.

[26]  Frank K. Soong,et al.  High performance connected digit recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[27]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[28]  G. Tajchman,et al.  Contextual effects in the perception of naturally produced vowels , 1990 .

[29]  R.A. Cole,et al.  Speaker-independent vowel recognition: spectrograms versus cochleagrams , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[30]  Ronald A. Cole,et al.  Speaker-independent recognition of spoken English letters , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[31]  Ronald A. Cole,et al.  Performing fine phonetic distinctions: templates versus features , 1990 .

[32]  Victor Zue,et al.  Detection and classification of phonemes using context-independent error back-propagation , 1990, ICSLP.

[33]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[34]  Helen Meng,et al.  The Use of Distinctive Features for Automatic Speech Recognition , 1991 .

[35]  Mei-Yuh Hwang,et al.  Improved acoustic modeling with the SPHINX speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[36]  David S. Pallett Session 2: DARPA Resource Management and ATIS Benchmark Test Poster Session , 1991, HLT.

[37]  Helen Meng,et al.  Signal representation comparison for phonetic classification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[38]  S. Zahorian,et al.  Dynamic spectral shape features as acoustic correlates for initial stop consonants , 1991 .

[39]  Benjamin Chigier,et al.  Phonetic Classification on Wide-Band and Telephone Quality Speech , 1992, HLT.

[40]  Gary Tajchman,et al.  Effects of context and redundancy in the perception of naturally produced English vowels , 1992, ICSLP.

[41]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[42]  Ronald A. Cole,et al.  Perceptual studies on vowels excised from continuous speech , 1992, ICSLP.

[43]  James R. Glass,et al.  Vowel classification based on analysis-by-synthesis , 1992, ICSLP.

[44]  Victor Zue,et al.  Automatic discovery of acoustic measurements for phonetic classification , 1988, ICSLP.

[45]  Hong C. Leung,et al.  The effects of signal representations, phonetic classification techniques, and the telephone network , 1992, ICSLP.

[46]  Jonathan G. Fiscus,et al.  Benchmark Tests for the DARPA Spoken Language Program , 1993, HLT.

[47]  Jean-Luc Gauvain,et al.  High performance speaker-independent phone recognition using CDHMM , 1993, EUROSPEECH.

[48]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[49]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[50]  James R. Glass,et al.  A comparative study of signal representations and classification techniques for speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  James R. Glass,et al.  Statistical trajectory models for phonetic recognition , 1994, ICSLP.

[52]  Biing-Hwang Juang,et al.  Minimum error rate training of inter-word context dependent acoustic model units in speech recognition , 1994, ICSLP.

[53]  Stephen A. Zahorian,et al.  Signal modeling enhancements for automatic speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[54]  Sean Connolly,et al.  Improvements in switchboard recognition and topic identification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[55]  Etienne Barnard,et al.  Explicit N-Best Formant Features for Segment-Based Speech Recognition , 1996 .

[56]  Michael Picheny,et al.  Speech recognition on Mandarin Call Home: a large-vocabulary, conversational, and telephone speech corpus , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[57]  Richard Lippmann,et al.  Recognition by humans and machines: miles to go before we sleep , 1996, Speech Commun..

[58]  James R. Glass,et al.  A probabilistic framework for feature-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[59]  Manish D. Muzumdar Automatic acoustic measurement optimization for segmental speech recognition , 1996 .

[60]  Mark A. Hasegawa-Johnson,et al.  Formant and burst spectral measurements with quantitative error models for speech sound classification , 1996 .

[61]  Raymond Y. T. Chun,et al.  A hierarchical feature representation for phonetic classification , 1996 .

[62]  Philip C. Woodland,et al.  The HTK large vocabulary recognition system for the 1995 ARPA H3 task , 1996 .

[63]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[64]  Stephen A. Zahorian,et al.  Analysis of speech segments using variable spectral/temporal resolution , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[65]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[66]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[67]  Li Deng,et al.  Use of generalized dynamic feature parameters for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[68]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[69]  Hynek Hermansky,et al.  Sub-band based recognition of noisy speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[70]  Victor Zue,et al.  From interface to content: translingual access and delivery of on-line information , 1997, EUROSPEECH.

[71]  Philip N. Garner,et al.  Using formant frequencies in speech recognition , 1997, EUROSPEECH.

[72]  Hervé Bourlard,et al.  Subband-based speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[73]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[74]  Alexander H. Waibel,et al.  Speaker normalization and speaker adaptation - a combination for conversational speech recognition , 1997, EUROSPEECH.

[75]  Li Deng,et al.  HMM-based speech recognition using state-dependent, discriminatively derived transforms on mel-warped DFT features , 1997, IEEE Trans. Speech Audio Process..

[76]  Stephen A. Zahorian,et al.  Phone classification with segmental features and a binary-pair partitioned neural network classifier , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[77]  James R. Glass,et al.  Segmentation and modeling in segment-based recognition , 1997, EUROSPEECH.

[78]  Steven C. Lee Probabilistic segmentation for segment-based speech recognition , 1998 .

[79]  Steven Greenberg,et al.  Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[80]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[81]  Alexandros Potamianos,et al.  Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[82]  James R. Glass,et al.  Telephone-based conversational speech recognition in the JUPITER domain , 1998, ICSLP.

[83]  Jane W. Chang,et al.  Near-miss modeling: a segment-based approach to speech recognition , 1998 .

[84]  Andrew K. Halberstadt,et al.  Using aggregation to improve the performance of mixture Gaussian acoustic models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[85]  Francis Jack Smith,et al.  Improved phone recognition using Bayesian triphone models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).