论文信息 - Using of heterogeneous corpora for training of an ASR system

Using of heterogeneous corpora for training of an ASR system

The paper summarizes the development of the LVCSR system built as a part of the Pashto speech-translation system at the SCALE (Summer Camp for Applied Language Exploration) 2015 workshop on "Speech-to-text-translation for low-resource languages". The Pashto language was chosen as a good "proxy" low-resource language, exhibiting multiple phenomena which make the speech-recognition and and speech-to-text-translation systems development hard. Even when the amount of data is seemingly sufficient, given the fact that the data originates from multiple sources, the preliminary experiments reveal that there is little to no benefit in merging (concatenating) the corpora and more elaborate ways of making use of all of the data must be worked out. This paper concentrates only on the LVCSR part and presents a range of different techniques that were found to be useful in order to benefit from multiple different corpora

[1] Lukás Burget,et al. Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[2] Sanjeev Khudanpur,et al. Pronunciation and silence probability modeling for ASR , 2015, INTERSPEECH.

[3] Zdravko Kacic,et al. A novel loss function for the overall risk criterion based discriminative training of HMM models , 2000, INTERSPEECH.

[4] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[5] Xiaohui Zhang,et al. Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[6] Tanel Alumäe. Neural network phone duration model for speech recognition , 2014, INTERSPEECH.

[7] Sanjeev Khudanpur,et al. Semi-supervised maximum mutual information training of deep neural network acoustic models , 2015, INTERSPEECH.

[8] Thomas Hain,et al. Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition , 2006, INTERSPEECH.

[9] Brian A. Weiss,et al. Evaluation methodology and metrics employed to assess the TRANSTAC two-way, speech-to-speech translation systems , 2013, Comput. Speech Lang..

[10] Sanjeev Khudanpur,et al. A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[11] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.