On context-dependent neural networks and speaker adaptation

This paper describes evaluation of a neural network based hybrid LVCSR system. The novelty of the evaluated hybrid system lies in speaker adaptation techniques that are employed to increase performance of neural networks for context-dependent phonetic units modeling. The performance comparison is done as follows. First, performances of different hybrid systems employing either a context-independent neural network or a context-dependent neural network are compared. Second, the influence of the recently published speaker adaptation technique called MELT is evaluated. Furthermore, several possible approaches to conversion of posterior probabilities into observation likelihoods, which are necessary for a hybrid LVSCR systems, are described and discussed in this paper.

[1]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[2]  Guangsen Wang,et al.  Sequential Classification Criteria for NNs in Automatic Speech Recognition , 2011, INTERSPEECH.

[3]  Jan Zelinka,et al.  Adaptation of a Feedforward Artificial Neural Network Using a Linear Transform , 2010, TSD.

[4]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[5]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Alexander H. Waibel,et al.  Context-dependent hybrid HME/HMM speech recognition using polyphone clustering decision trees , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jan Trmal Spatio-temporal structure of feature vectors in neural network adaptation , 2012 .

[8]  Jan Zelinka,et al.  On speaker adaptive training of artificial neural networks , 2010, INTERSPEECH.

[9]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[10]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[12]  Guangsen Wang,et al.  Comparison of Smoothing Techniques for Robust Context Dependent Acoustic Modelling in Hybrid NN/HMM Systems , 2011, INTERSPEECH.

[13]  Jan Svec,et al.  Fast Phonetic/Lexical Searching in the Archives of the Czech Holocaust Testimonies: Advancing Towards the MALACH Project Visions , 2010, TSD.