Robust Speech Rate Estimation for Spontaneous Speech

In this paper, we propose a direct method for speech rate estimation from acoustic features without requiring any automatic speech transcription. We compare various spectral and temporal signal analysis and smoothing strategies to better characterize the underlying syllable structure to derive speech rate. The proposed algorithm extends the methods of spectral sub- band correlation by including temporal correlation and the use of prominent spectral subbands for improving the signal correlation essential for syllable detection. Furthermore, to address some of the practical robustness issues in previously proposed methods, we introduce some novel components into the algorithm such as the use of pitch confidence for filtering spurious syllable envelope peaks, magnifying window for tackling neighboring syllable smearing, and relative peak measure thresholds for pseudo peak rejection. We also describe an automated approach for learning algorithm parameters from data, and find the optimal settings through Monte Carlo simulations and parameter sensitivity analysis. Final experimental evaluations are conducted based on a portion of the Switchboard corpus for which manual phonetic segmentation information, and published results for direct comparison are available. The results show a correlation coefficient of 0.745 with respect to the ground truth based on manual segmentation. This result is about a 17% improvement compared to the current best single estimator and a 11% improvement over the multiestimator evaluated on the same Switchboard database.

[1]  Satoshi Kobayashi,et al.  Extraction and representation rhythmic components of spontaneous speech , 1997, EUROSPEECH.

[2]  Anthony Jameson,et al.  Interpreting symptoms of cognitive load in speech input , 1999 .

[3]  Douglas D. O'Shaughnessy Timing patterns in fluent and disfluent spontaneous speech , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[5]  Eric Fosler-Lussier,et al.  Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[6]  Dan Jurafsky,et al.  Effects of disfluencies, predictability, and utterance position on word form variation in English conversation. , 2003, The Journal of the Acoustical Society of America.

[7]  H. Vincent Poor,et al.  An Introduction to Signal Detection and Estimation , 1994, Springer Texts in Electrical Engineering.

[8]  Mei-Yuh Hwang,et al.  Improvements on speech recognition for fast talkers , 1999, EUROSPEECH.

[9]  C. Weinstein,et al.  A system for acoustic-phonetic analysis of continuous speech , 1975 .

[10]  Thilo Pfau,et al.  A combination of speaker normalization and speech rate normalization for automatic speech recognition , 2000, INTERSPEECH.

[11]  Shrikanth S. Narayanan,et al.  Speech rate estimation via temporal correlation and selected sub-band correlation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Florien J. van Beinum,et al.  Relationship between discourse structure and dynamic speech rate , 1996, ICSLP.

[13]  Shrikanth S. Narayanan,et al.  An unsupervised quantitative measure for word prominence in spontaneous speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[14]  Günther Ruske,et al.  Syllable segmentation of continuous speech with artificial neural networks , 1993, EUROSPEECH.

[15]  L. R. Rabiner,et al.  On the application of energy contours to the recognition of connected word sequences , 1984, AT&T Bell Laboratories Technical Journal.

[16]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[17]  Daniel Tapias Merino,et al.  Towards speech rate independence in large vocabulary continuous speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18]  Eric Fosler-Lussier,et al.  Towards robustness to fast speech in ASR , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  James Holmes,et al.  The JSRU channel vocoder , 1980 .

[20]  Matthew A. Siegler,et al.  Measuring and Compensating for the Effects of Speech Rate in Large Vocabulary Continuous Speech Recognition , 1995 .

[21]  Kenneth N. Stevens,et al.  Automatic syllable detection for vowel landmarks , 2000 .

[22]  Fabio Tamburini,et al.  Automatic prosodic prominence detection in speech using acoustic features: an unsupervised system , 2003, INTERSPEECH.

[23]  Brigitte Zellner Fast and slow speech rate: a characterisation for French , 1998, ICSLP.

[24]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[25]  R. M. Dauer Stress-timing and syllable-timing reanalyzed. , 1983 .

[26]  Andreas Stolcke,et al.  Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues , 2002, INTERSPEECH.

[27]  J. L. Miller,et al.  Articulation Rate and Its Variability in Spontaneous Speech: A Reanalysis and Some Implications , 1984, Phonetica.

[28]  Caroline L. Smith Handbook of the International Phonetic Association: a guide to the use of the International Phonetic Alphabet (1999). Cambridge: Cambridge University Press. Pp. ix+204. , 2000, Phonology.

[29]  Andreas Stolcke,et al.  RATE-DEPENDENT ACOUSTIC MODELING FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION , 2000 .

[30]  Phil D. Green,et al.  Speech representations in the SYLK recognition project , 1993 .

[31]  H. Wakita,et al.  An approach to segmenting speech into vowel-and nonvowel-like intervals , 1979 .

[32]  Susanne Burger,et al.  Syllable detection in read and spontaneous speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[33]  E Paulus,et al.  Automatic speech recognition using psychoacoustic models. , 1979, The Journal of the Acoustical Society of America.

[34]  R. M. Dauer Phonetic and Phonological Components of Language Rhythm , 1987 .

[35]  J. Blevins The Syllable in Phonological Theory , 1995 .

[36]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[37]  Peter Roach English Phonetics and Phonology:A Practical Course , 1983 .

[38]  Eric Fosler-Lussier,et al.  Fast speakers in large vocabulary continuous speech recognition: analysis & antidotes , 1995, EUROSPEECH.

[39]  A. Stolcke,et al.  Automatic detection of discourse structure for speech recognition and understanding , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[40]  S. Nooteboom,et al.  THE PROSODY OF SPEECH: MELODY AND RHYTHM , 2001 .

[41]  Shrikanth S. Narayanan,et al.  An Acoustic Measure for Word Prominence in Spontaneous Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Tatsuya Kawahara,et al.  Speaking-rate dependent decoding and adaptation for spontaneous lecture speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[44]  P. Keating,et al.  Articulatory strengthening at edges of prosodic domains. , 1997, The Journal of the Acoustical Society of America.

[45]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[47]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[48]  Timothy Diller,et al.  An automatic word spotting system for conversational speech , 1978, ICASSP.