Neural network phone duration model for speech recognition

In this paper, we describe a novel phone duration model that is used to improve the accuracy of a large vocabulary speech recognition system based on state-of-the-art speaker-adapted DNN acoustic models. The duration model calculates the probability density function of phone duration from phone’s contextual features using a neural network which is then applied for word lattice rescoring. Experimental results are given for Estonian, English and Finnish transcription tasks. An absolute word error rate reduction of 0.8-1.4% is observed across all evaluation sets.

[1]  William W. Hsieh Machine Learning Methods in the Environmental Sciences: References , 2009 .

[2]  Tanel Alum Automatic Compound Word Reconstruction for Speech Recognition of Compounding Languages , 2007 .

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Mathias Creutz,et al.  Unsupervised Morphology Induction Using Morfessor , 2005, FSMNLP.

[5]  Orhan Karaali,et al.  Speech Synthesis with Neural Networks , 1998, ArXiv.

[6]  Daniel Povey Phone duration modeling for LVCSR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Rena Nemoto,et al.  Phone duration modeling using clustering of rich contexts , 2013, INTERSPEECH.

[8]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[9]  Richard M. Schwartz,et al.  Duration modeling in large vocabulary speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Ralph Neuneier,et al.  Estimation of Conditional Densities: A Comparison of Neural Network Approaches , 1994 .

[11]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Daniele Falavigna,et al.  Word duration modeling for word graph rescoring in LVCSR , 2007, INTERSPEECH.

[13]  Tanel Alumäe Recent improvements in Estonian LVCSR , 2014, SLTU.

[14]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[15]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[16]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Jean-Luc Gauvain,et al.  Modeling Duration via Lattice Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[19]  Venkata Ramana Rao,et al.  MODELING WORD DURATION FOR BETTER SPEECH RECOGNITION , 2008 .