Estimating speaking rate in spontaneous discourse

In this paper we consider the problem of estimating the speaking rate directly from the speech waveform. We propose an algorithm that poses the speaking rate estimation problem as a convex optimization problem. In contrast to existing methods, we avoid the more difficult task of detecting individual syllables within the speech signal and we avoid heuristics like thresholding a loudness function. The algorithm was evaluated on the ICSI Switchboard spontaneous speech corpus and a speech corpus obtained from publicly-available interviews on Youtube.

[1]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[2]  J. L. Miller,et al.  Effect of speaking rate on the perceptual structure of a phonetic category , 1989, Perception & psychophysics.

[3]  Raymond D. Kent,et al.  Effects of speech rate on the absolute and relative timing of apraxic and conduction aphasic sentence production , 1990, Brain and Language.

[4]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Raymond D. Kent,et al.  Speaking rate and speech movement velocity profiles. , 1993, Journal of speech and hearing research.

[6]  G. Weismer,et al.  The influence of speaking rate on vowel space and speech intelligibility for individuals with amyotrophic lateral sclerosis. , 1995, Journal of speech and hearing research.

[7]  Jean-Pierre Martens,et al.  A fast and reliable rate of speech detector , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[9]  Eric Fosler-Lussier,et al.  Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[10]  Hartmut R. Pfitzinger,et al.  Local speech rate as a combination of syllable and phone rate , 1998, ICSLP.

[11]  Thilo Pfau,et al.  Estimating the speaking rate by vowel detection , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Thilo Pfau,et al.  On-line speaking rate estimation using Gaussian mixture models , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[14]  Raymond D. Kent,et al.  Dysarthria associated with traumatic brain injury: speaking rate and emphatic stress. , 2005, Journal of communication disorders.

[15]  Robert F. Kubichek,et al.  Estimation of the number of syllables using hidden markov models and design of a dysarthria classifier using global statistics of speech , 2006 .

[16]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[17]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[18]  Shrikanth S. Narayanan,et al.  Robust Speech Rate Estimation for Spontaneous Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Johan A. K. Suykens,et al.  Kernel Component Analysis Using an Epsilon-Insensitive Robust Loss Function , 2008, IEEE Transactions on Neural Networks.

[20]  Thomas Fang Zheng,et al.  Comparison of different implementations of MFCC , 2001, Journal of Computer Science and Technology.

[21]  James R. Glass,et al.  Speech rhythm guided syllable nuclei detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Fred Cummins,et al.  Rhythm as entrainment: The case of synchronous speech , 2009, J. Phonetics.

[23]  Nivja H. Jong,et al.  Praat script to detect syllable nuclei and measure speech rate automatically , 2009, Behavior research methods.

[24]  S. Spitzer,et al.  Quantifying speech rhythm abnormalities in the dysarthrias. , 2009, Journal of speech, language, and hearing research : JSLHR.

[25]  J. Liss,et al.  Discriminating dysarthria type from envelope modulation spectra. , 2010, Journal of speech, language, and hearing research : JSLHR.

[26]  Thomas F. Quatieri,et al.  Vocal biomarkers of depression based on motor incoordination , 2013, AVEC@ACM Multimedia.

[27]  Stephanie A. Borrie,et al.  Rhythm as a coordinating device: entrainment with disordered speech. , 2014, Journal of speech, language, and hearing research : JSLHR.

[28]  Yishan Jiao,et al.  Towards improving statistical model based voice activity detection , 2014, INTERSPEECH.

[29]  Visar Berisha,et al.  Convex Weighting Criteria for Speaking Rate Estimation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.