论文信息 - Neural network phone duration model for speech recognition

Neural network phone duration model for speech recognition

In this paper, we describe a novel phone duration model that is used to improve the accuracy of a large vocabulary speech recognition system based on state-of-the-art speaker-adapted DNN acoustic models. The duration model calculates the probability density function of phone duration from phone’s contextual features using a neural network which is then applied for word lattice rescoring. Experimental results are given for Estonian, English and Finnish transcription tasks. An absolute word error rate reduction of 0.8-1.4% is observed across all evaluation sets.

Tanel Alumäe

[1] William W. Hsieh. Machine Learning Methods in the Environmental Sciences: References , 2009 .

[2] Tanel Alum. Automatic Compound Word Reconstruction for Speech Recognition of Compounding Languages , 2007 .

[3] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[4] Mathias Creutz,et al. Unsupervised Morphology Induction Using Morfessor , 2005, FSMNLP.

[5] Orhan Karaali,et al. Speech Synthesis with Neural Networks , 1998, ArXiv.

[6] Daniel Povey. Phone duration modeling for LVCSR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7] Rena Nemoto,et al. Phone duration modeling using clustering of rich contexts , 2013, INTERSPEECH.

[8] Lukás Burget,et al. Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[9] Richard M. Schwartz,et al. Duration modeling in large vocabulary speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[10] Ralph Neuneier,et al. Estimation of Conditional Densities: A Comparison of Neural Network Approaches , 1994 .

[11] Brian Kingsbury,et al. Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12] Daniele Falavigna,et al. Word duration modeling for word graph rescoring in LVCSR , 2007, INTERSPEECH.

[13] Tanel Alumäe. Recent improvements in Estonian LVCSR , 2014, SLTU.

[14] Yoshua Bengio,et al. Maxout Networks , 2013, ICML.

[15] Krzysztof Marasek,et al. SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[16] R. Moore,et al. Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17] Jean-Luc Gauvain,et al. Modeling Duration via Lattice Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18] Stephen E. Levinson,et al. Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[19] Venkata Ramana Rao,et al. MODELING WORD DURATION FOR BETTER SPEECH RECOGNITION , 2008 .