A Initial Attempt on Task-Specific Adaptation for Deep Neural Network-based Large Vocabulary Continuous Speech Recognition

In the state-of-the-art automatic speech recognition (ASR) systems, adaption techniques are used to the mitigate performance degradation caused by the mismatch in the training and testing procedure. Although there are bunch of adaption techniques for the hidden Markov models (HMM)-GMM-based system[3], there is rare work about the adaption in the hybrid artificial neural network (ANN)/HMM-based system [7][8]. Recently, there is a resurgence on ANN/HMM scheme for ASR with the success of context dependent deep neural network HMM (CD-DNN/HMM). Therefore in this paper, we present our initial efforts on the adaption techniques in the CDDNN/HMM system. Specially, a linear input network(LIN)based method and a neural network retraining(NNR)-based method is experimentally explored for the the task-adaptation purpose. Experiments on conversation telephone speech data set shows that these techniques can improve the system significantly and LIN-based method seems to work better with medium mount of adaptation data.

[1]  Tony R. Martinez,et al.  The need for small learning rates on large problems , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[2]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[3]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[4]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[5]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[6]  Ciro Martins,et al.  Speaker-adaptation in a hybrid HMM-MLP recognizer , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[11]  Steve Renals,et al.  The 1995 ABBOT LVCSR system for multiple unknown microphones , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  N. Morgan,et al.  A training algorithm for statistical sequence recognition with applications to transition-based speech recognition , 1996, IEEE Signal Processing Letters.

[13]  Victor Abrash,et al.  Mixture input transformations for adaptation of hybrid connectionist speech recognizers , 1997, EUROSPEECH.

[14]  Hervé Bourlard,et al.  CDNN: a context dependent neural network for continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[16]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.