Modelling the effects of speech rate variation for automatic speech recognition

In automatic speech recognition it is a widely observed phenomenon that variations in speech rate cause severe degradations of the speech recognition performance. This is due to the fact that standard stochastic based speech recognition systems specialise on average speech rate. Although many approaches to modelling speech rate variation have been made, an integrated approach in a substantial system still has be to developed. General approaches to rate modelling are based on rate dependent models which are trained with rate specific subsets of the training data. During decoding a signal based rate estimation is performed according to which the set of rate dependent models is selected. While such approaches are able to reduce the word error rate significantly, they suffer from shortcomings such as the reduction of training data and the expensive training and decoding procedure. However, phonetic investigations show that there is a systematic relationship between speech rate and the acoustic characteristics of speech. In fast speech a tendency of reduction can be observed which can be described in more detail as a centralisation effect and an increase in coarticulation. Centralisation means that the formant frequencies of vowels tend to shift towards the vowel space center while increased coarticulation denotes the tendency of the spectral features of a vowel to shift towards those of its phonemic neighbour. The goal of this work is to investigate the possibility to incorporate the knowledge of the systematic nature of the influence of speech rate variation on the acoustic features in speech rate modelling. In an acoustic-phonetic analysis of a large corpus of spontaneous speech it was shown that an increased degree of the two effects of centralisation and coarticulation can be found in fast speech. Several measures for these effects were developed and used in speech recognition experiments with rate dependent models. A thorough investigation of rate dependent models showed that with duration and coarticulation based measures significant increases of the performance could be achieved. It was shown that by the use of different measures the models were adapted either to centralisation or coarticulation. Further experiments showed that by a more detailed modelling with more rate classes a further improvement can be achieved. It was also observed that a general basis for the models is needed before rate adaptation can be performed. In a comparison to other sources of acoustic variation it was shown that the effects of speech rate are as severe as those of speaker variation and environmental noise. All these results show that for a more substantial system that models rate variations accurately it is necessary to focus on both, durational and spectral effects. The systematic nature of the effects indicates that a continuous modelling is possible.

[1]  J. Laver,et al.  The handbook of phonetic sciences , 1999 .

[2]  Matthew P. Aylett,et al.  Stochastic suprasegmentals: relationships between redundancy, prosodic structure and care of articulation in spontaneous speech , 2000, INTERSPEECH.

[3]  Dick R. van Bergem Experimental evidence for a comprehensive theory of vowel reduction , 1995, EUROSPEECH.

[4]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[5]  Frank Parker,et al.  Mentalism vs physicalism: a comment on Hammarberg and Fowler , 1985 .

[6]  Eric Fosler-Lussier,et al.  Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[7]  T. Gay Effect of speaking rate on vowel formant movements. , 1978, The Journal of the Acoustical Society of America.

[8]  Hermann Ney,et al.  Dynamic programming search for continuous speech recognition , 1999, IEEE Signal Process. Mag..

[9]  K. Kohler Einführung in die Phonetik des Deutschen , 1981 .

[10]  Matthias Pätzold,et al.  Handbuch zur Datenaufnahme und Transliteration in TP14 von Verbmobil - 3.0 , 1994 .

[11]  R. Plomp,et al.  Dimensional analysis of vowel spectra , 1967 .

[12]  K. Moll,et al.  A cineradiographic study of VC and CV articulatory velocities , 1976 .

[13]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[14]  B. Lindblom Spectrographic Study of Vowel Reduction , 1963 .

[15]  Louis C. W. Pols,et al.  Acoustics and perception of dynamic vowel segments , 1993, Speech Commun..

[16]  D. Shankweiler,et al.  What information enables a listener to map a talker's vowel space? , 1976, The Journal of the Acoustical Society of America.

[17]  A. Liberman,et al.  Some effects of later-occurring information on the perception of stop consonant and semivowel , 1979, Perception & psychophysics.

[18]  Sidney A J Wood,et al.  Assimilation or coarticulation? Evidence from the coordination of tongue gestures for the palatalization of Bulgarian alveolar stops. , 1996 .

[19]  R. Port,et al.  Consonant/vowel ratio as a cue for voicing in English , 1982, Perception & psychophysics.

[20]  Horacio Franco,et al.  RATE-OF-SPEECH MODELING FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION , 2003 .

[21]  Günther Ruske,et al.  Continuous speech recognition using syllabic segmentation and demisyllable hidden Markov models , 1989, EUROSPEECH.

[22]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[23]  Jing Zheng,et al.  Word-level rate of speech modeling using rate-specific phones and pronunciations , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[24]  Dick R. van Bergem,et al.  Acoustic vowel reduction as a function of sentence accent, word stress, and word class , 1993, Speech Commun..

[25]  Thilo Pfau Methoden zur Erhöhung der Robustheit automatischer Spracherkennungssysteme gegenüber Variationen der Sprechgeschwindigkeit , 2000 .

[26]  J. L. Miller,et al.  Articulation Rate and Its Variability in Spontaneous Speech: A Reanalysis and Some Implications , 1984, Phonetica.

[27]  G. E. Peterson,et al.  Duration of Syllable Nuclei in English , 1960 .

[28]  Koopmans-Van Beinum,et al.  Vowel contrast reduction : an acoustic and perceptual study of Dutch vowels in various speech conditions , 1980 .

[29]  R. Plomp,et al.  Perceptual and physical space of vowel sounds. , 1969, The Journal of the Acoustical Society of America.

[30]  Steve J. Young,et al.  Modelling speaking rate using a between frame distance metric , 1999, EUROSPEECH.

[31]  Gernot A. Fink,et al.  Influence of duration on static and dynamic properties of German vowels in spontaneous speech , 2000, INTERSPEECH.

[32]  M. Fourakis,et al.  Tempo, stress, and vowel reduction in American English. , 1991, The Journal of the Acoustical Society of America.

[33]  Astrid Paeschke,et al.  Articulatory reduction in emotional speech , 1999, EUROSPEECH.

[34]  Louis C. W. Pols,et al.  An acoustic profile of consonant reduction , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[35]  B. Lindblom,et al.  Interaction between duration, context, and speaking style in English stressed vowels , 1994 .

[36]  D. Broadbent,et al.  Information Conveyed by Vowels , 1957 .

[37]  R. Plomp,et al.  Frequency analysis of Dutch vowels from 50 male speakers. , 1973, The Journal of the Acoustical Society of America.

[38]  R. Kager,et al.  Introduction: phonetics in phonology , 2001, Phonology.

[39]  Gernot A. Fink,et al.  An investigation of modelling aspects for ratedependent speech recognition , 2001, INTERSPEECH.

[40]  Mei-Yuh Hwang,et al.  Improvements on speech recognition for fast talkers , 1999, EUROSPEECH.

[41]  Jean-Pierre Martens,et al.  A fast and reliable rate of speech detector , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[42]  S. Zahorian,et al.  Spectral-shape features versus formants as acoustic correlates for vowels. , 1993, The Journal of the Acoustical Society of America.

[43]  K. Scherer,et al.  Effect of experimentally induced stress on vocal parameters. , 1986, Journal of experimental psychology. Human perception and performance.

[44]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[45]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[46]  Louis C. W. Pols,et al.  What does consonant reduction look like, if it exists? , 1995, EUROSPEECH.

[47]  T. Gay Effect of speaking rate on diphthong formant movements. , 1968, The Journal of the Acoustical Society of America.

[48]  Gernot A. Fink Developing HMM-Based Recognizers with ESMERALDA , 1999, TSD.

[49]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[50]  William J. Barry Time as a factor in the acoustic variation of schwa , 1998, ICSLP.

[51]  J L Miller,et al.  The influence of sentential speaking rate on the internal structure of phonetic categories. , 1994, The Journal of the Acoustical Society of America.

[52]  Jeff A. Bilmes,et al.  Statistical acoustic indications of coarticulation , 1999 .

[53]  Thilo Pfau,et al.  Creating hidden Markov models for fast speech by optimized clustering , 1999, EUROSPEECH.

[54]  D. Shankweiler,et al.  Prosodic information for vowel identity , 1977 .

[55]  Thilo Pfau,et al.  Estimating the speaking rate by vowel detection , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[56]  Eric Fosler-Lussier,et al.  Fast speakers in large vocabulary continuous speech recognition: analysis & antidotes , 1995, EUROSPEECH.

[57]  W. Klein,et al.  Vowel spectra, vowel spaces, and vowel identification. , 1970, The Journal of the Acoustical Society of America.

[58]  Gerhard Sagerer,et al.  Untersuchung der Faktoren Dauer und Koartikulation bei der Modellierung von Sprechgeschwindigkeit in der Spracherkennung , 2001 .

[59]  Chin-Hui Lee,et al.  On the asymptotic statistical behavior of empirical cepstral coefficients , 1993, IEEE Trans. Signal Process..

[60]  Agaath M. C. Sluijter,et al.  Spectral balance as an acoustic correlate of linguistic stress. , 1996, The Journal of the Acoustical Society of America.

[61]  F. Grosjean,et al.  Analyse contrastive des variables temporelles de l’anglais et du français: vitesse de parole et variables composantes, phénomènes d’hésitation , 1975 .

[62]  Daniel Tapias Merino,et al.  Characteristics of slow, average and fast speech and their effects in large vocabulary continuous speech recognition , 1997, EUROSPEECH.

[63]  Louis C. W. Pols,et al.  An acoustic description of consonant reduction , 1999, Speech Commun..

[64]  Colin Yallop,et al.  An Introduction to Phonetics and Phonology , 1990 .

[65]  Jürgen Trouvain,et al.  Articulation Rate Measures and Their Relation to Phone Classification in Spontaneous and Read German Speech , 2001 .

[66]  Daniel Tapias Merino,et al.  Towards speech rate independence in large vocabulary continuous speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[67]  Pierre Delattre,et al.  An Acoustic and Articulatory Study of Vowel Reduction in Four Languages. , 1969 .