论文信息 - Improvements to speaker adaptive training of deep neural networks

Improvements to speaker adaptive training of deep neural networks

Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on word error rates (WERs). In this paper, we present different methods to further improve and extend SAT-DNN. First, we conduct detailed analysis to investigate i-vector extractor training and flexible feature fusion. Second, the SAT-DNN approach is extended to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNN-based feature extraction. Third, for transcribing multimedia data, we enrich the i-vector representation with global speaker attributes (age, gender, etc.) obtained automatically from video signals. On a collection of instructional videos, incorporation of the additional visual features is observed to boost the recognition accuracy of SAT-DNN.

[1] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[2] Chalapathy Neti,et al. Asynchrony modeling for audio-visual speech recognition , 2002 .

[3] Jonathan Le Roux,et al. Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Nicholas W. D. Evans,et al. ALIZE/spkdet: a state-of-the-art open source software for speaker recognition , 2008, Odyssey.

[5] Patrick Kenny,et al. Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[6] Khe Chai Sim,et al. Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[7] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[8] Lukás Burget,et al. iVector-based discriminative adaptation for automatic speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[9] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[10] Dong Yu,et al. Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[11] Kaisheng Yao,et al. Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[12] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Keikichi Hirose,et al. Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[14] Gerald Penn,et al. Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Navdeep Jaitly,et al. Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[16] Jinyu Li,et al. Hermitian based Hidden Activation Functions for Adaptation of Hybrid HMM/ANN Models , 2012, INTERSPEECH.

[17] Yifan Gong,et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] Hank Liao,et al. Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[19] Florian Metze,et al. Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20] Tara N. Sainath,et al. Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Jan Cernocký,et al. Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[22] Tara N. Sainath,et al. Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23] Hui Jiang,et al. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24] Wonkyum Lee,et al. Modular combination of deep neural networks for acoustic modeling , 2013, INTERSPEECH.

[25] George Saon,et al. Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[26] Florian Metze,et al. Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[27] Yajie Miao,et al. Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN , 2014, ArXiv.

[28] Florian Metze,et al. Towards speaker adaptive training of deep neural network acoustic models , 2014, INTERSPEECH.

[29] Florian Metze,et al. Distributed learning of multilingual DNN feature extractors using GPUs , 2014, INTERSPEECH.

[30] Florian Metze,et al. Improving language-universal feature extraction with deep maxout and convolutional neural networks , 2014, INTERSPEECH.