Phone duration modeling using clustering of rich contexts

This paper describes a phone duration model applied to speech recognition. The model is based on a decision tree that finds clusters of phones in various contexts that tend to have similar durations. Wide contexts with rich linguistic and phonetic features are used. To better model varying and non-stationary speaking rates, the contextual features also include the observed duration values of previous phones. For each resulting phone cluster, a log-normal distribution of duration is estimated. The resulting decision tree and the log-normal distributions are used to calculate likelihoods of phone durations in N-best lists. Experiments on two Estonian recognition tasks show a small but significant improvement in speech recognition accuracy.

[1]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[3]  Richard M. Schwartz,et al.  Duration modeling in large vocabulary speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Stephanie Seneff,et al.  Hierarchical duration modelling for speech recognition using the ANGIE framework , 1997, EUROSPEECH.

[5]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[6]  Daniel Povey Phone duration modeling for LVCSR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jean-Luc Gauvain,et al.  Modeling Duration via Lattice Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Daniele Falavigna,et al.  Word duration modeling for word graph rescoring in LVCSR , 2007, INTERSPEECH.

[9]  Venkata Ramana Rao,et al.  MODELING WORD DURATION FOR BETTER SPEECH RECOGNITION , 2008 .

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Pärtel Lippus,et al.  The acoustic features and perception of the Estonian quantity system , 2011 .

[12]  Tanel Alumäe Transcription System for Semi-Spontaneous Estonian Speech , 2012, Baltic HLT.

[13]  Pärtel Lippus,et al.  Quantity-related variation of duration, pitch and vowel quality in spontaneous Estonian , 2013, J. Phonetics.

[14]  Eva Liina Asu-Garcia,et al.  Native and non-native production of Estonian quantity degrees: Comparison of Estonian, Finnish and Russian subjects , 2013 .