Training Context-Dependent DNN Acoustic Models Using Probabilistic Sampling

In current HMM/DNN speech recognition systems, the purpose of the DNN component is to estimate the posterior probabilities of tied triphone states. In most cases the distribution of these states is uneven, meaning that we have a markedly different number of training samples for the various states. This imbalance of the training data is a source of suboptimality for most machine learning algorithms, and DNNs are no exception. A straightforward solution is to re-sample the data, either by upsampling the rarer classes or by dowsampling the more common classes. Here, we experiment with the so-called probabilistic sampling method that applies downsampling and upsampling at the same time. For this, it defines a new class distribution for the training data, which is a linear combination of the original and the uniform class distributions. As an extension to previous studies, we propose a new method to re-estimate the class priors, which is required to remedy the mismatch between the training and the test data distributions introduced by re-sampling. Using probabilistic sampling and the proposed modification we report 5% and 6% relative error rate reductions on the TED-LIUM and on the AMI corpora, respectively.

[1]  Ah Chung Tsoi,et al.  Neural Network Classification and Prior Class Probabilities , 1996, Neural Networks: Tricks of the Trade.

[2]  Damir Kalpic,et al.  The effect of class distribution on classification algorithms in credit risk assessment , 2016, 2016 39th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[3]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[4]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[5]  László Tóth,et al.  A Comparison of Deep Neural Network Training Methods for Large Vocabulary Speech Recognition , 2013, TSD.

[6]  Carmen Peláez-Moreno,et al.  Data Balancing for Efficient Training of Hybrid ANN/HMM Automatic Speech Recognition Systems , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[8]  Yonghong Yan,et al.  Improving HMM/DNN in ASR of under-resourced languages using probabilistic sampling , 2015, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP).

[9]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Conversational Speech Recognition , 2012 .

[11]  László Tóth,et al.  Training HMM/ANN Hybrid Speech Recognizers by Probabilistic Sampling , 2005, ICANN.

[12]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[13]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[14]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[15]  Róbert Busa-Fekete,et al.  Detecting the intensity of cognitive and physical load using AdaBoost and deep rectifier neural networks , 2014, INTERSPEECH.