Learning speech rate in speech recognition

A significant performance reduction is often observed in speech recognition when the rate of speech (ROS) is too low or too high. Most of present approaches to addressing the ROS variation focus on the change of speech signals in dynamic properties caused by ROS, and accordingly modify the dynamic model, e.g., the transition probabilities of the hidden Markov model (HMM). However, an abnormal ROS changes not only the dynamic but also the static property of speech signals, and thus can not be compensated for purely by modifying the dynamic model. This paper proposes an ROS learning approach based on deep neural networks (DNN), which involves an ROS feature as the input of the DNN model and so the spectrum distortion caused by ROS can be learned and compensated for. The experimental results show that this approach can deliver better performance for too slow and too fast utterances, demonstrating our conjecture that ROS impacts both the dynamic and the static property of speech. In addition, the proposed approach can be combined with the conventional HMM transition adaptation method, offering additional performance gains.

[1]  Yinghao Li,et al.  Effect of speech rate on inter-segmental coarticulation in Standard Chinese , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[2]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[3]  Daniel Tapias Merino,et al.  Towards speech rate independence in large vocabulary continuous speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Daniel Povey,et al.  Speaking rate adaptation using continuous frame rate normalization , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Benoît Favre,et al.  Speaker adaptation of DNN-based ASR with i-vectors: does it actually adapt models to speakers? , 2014, INTERSPEECH.

[6]  Richard M. Stern,et al.  On the effects of speech rate in large vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  L. Shastri,et al.  SYLLABLE DETECTION AND SEGMENTATION USING TEMPORAL FLOW NEURAL NETWORKS , 1999 .

[8]  James R. Glass,et al.  Speech rhythm guided syllable nuclei detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Themos Stafylakis,et al.  I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Shrikanth S. Narayanan,et al.  Robust Speech Rate Estimation for Spontaneous Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Nivja H. Jong,et al.  Praat script to detect syllable nuclei and measure speech rate automatically , 2009, Behavior research methods.

[12]  Jean-Pierre Martens,et al.  A fast and reliable rate of speech detector , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  Eric Fosler-Lussier,et al.  Fast speakers in large vocabulary continuous speech recognition: analysis & antidotes , 1995, EUROSPEECH.

[14]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[15]  Florian Schiel,et al.  Estimating Speaking Rate by Means of Rhythmicity Parameters , 2011, INTERSPEECH.

[16]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[17]  Mineichi Kudo,et al.  Speech rate change detection in martingale framework , 2012, 2012 12th International Conference on Intelligent Systems Design and Applications (ISDA).

[18]  Eric Fosler-Lussier,et al.  Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.