Hermitian Polynomial for Speaker Adaptation of Connectionist Speech Recognition Systems

Model adaptation techniques are an efficient way to reduce the mismatch that typically occurs between the training and test condition of any automatic speech recognition (ASR) system. This work addresses the problem of increased degradation in performance when moving from speaker-dependent (SD) to speaker-independent (SI) conditions for connectionist (or hybrid) hidden Markov model/artificial neural network (HMM/ANN) systems in the context of large vocabulary continuous speech recognition (LVCSR). Adapting hybrid HMM/ANN systems on a small amount of adaptation data has been proven to be a difficult task, and has been a limiting factor in the widespread deployment of hybrid techniques in operational ASR systems. Addressing the crucial issue of speaker adaptation (SA) for hybrid HMM/ANN system can thereby have a great impact on the connectionist paradigm, which will play a major role in the design of next-generation LVCSR considering the great success reported by deep neural networks - ANNs with many hidden layers that adopts the pre-training technique - on many speech tasks. Current adaptation techniques for ANNs based on injecting an adaptable linear transformation network connected to either the input, or the output layer are not effective especially with a small amount of adaptation data, e.g., a single adaptation utterance. In this paper, a novel solution is proposed to overcome those limits and make it robust to scarce adaptation resources. The key idea is to adapt the hidden activation functions rather than the network weights. The adoption of Hermitian activation functions makes this possible. Experimental results on an LVCSR task demonstrate the effectiveness of the proposed approach.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Biing-Hwang Juang,et al.  Maximum likelihood estimation for multivariate mixture observations of markov chains , 1986, IEEE Trans. Inf. Theory.

[3]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[4]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[5]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[6]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[7]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[8]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[9]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[10]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[11]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[12]  George Zavaliagkos,et al.  Batch, incremental and instantaneous adaptation techniques for speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[13]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[14]  Horacio Franco,et al.  Connectionist speaker normalization and adaptation , 1995, EUROSPEECH.

[15]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[16]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[17]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[18]  Chin-Hui Lee,et al.  On-line adaptive learning of the correlated continuous density hidden Markov models for speech recognition , 1998, IEEE Trans. Speech Audio Process..

[19]  Yurij Kharin Robustness in Statistical Pattern Recognition , 1996 .

[20]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[21]  Chin-Hui Lee,et al.  On stochastic feature and model compensation approaches to robust speech recognition , 1998, Speech Commun..

[22]  Giovanni Pilato,et al.  Using the Hermite Regression Formula to Design a Neural Architecture with Automatic Learning of the "Hidden" Activation Functions , 1999, AI*IA.

[23]  Qiang Huo,et al.  On adaptive decision rules and decision parameter adaptation for automatic speech recognition , 2000, Proceedings of the IEEE.

[24]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[25]  William J. Byrne,et al.  Discriminative speaker adaptation with conditional maximum likelihood linear regression , 2001, INTERSPEECH.

[26]  Chin-Hui Lee,et al.  A structural Bayes approach to speaker adaptation , 2001, IEEE Trans. Speech Audio Process..

[27]  Chin-Hui Lee,et al.  Joint maximum a posteriori adaptation of transformation and HMM parameters , 2001, IEEE Trans. Speech Audio Process..

[28]  Chin-Hui Lee,et al.  Upper and lower bounds on the mean of noisy speech: application to minimax classification , 2002, IEEE Trans. Speech Audio Process..

[29]  Mark J. F. Gales,et al.  MMI-MAP and MPE-MAP for acoustic model adaptation , 2003, INTERSPEECH.

[30]  C.-H. Lee,et al.  From knowledge-ignorant to knowledge-rich modeling : a new speech research parading for next generation automatic speech recognition , 2004 .

[31]  Khashayar Khorasani,et al.  Constructive feedforward neural networks using Hermite polynomial activation functions , 2005, IEEE Transactions on Neural Networks.

[32]  William J. Byrne,et al.  Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation , 2005, IEEE Transactions on Speech and Audio Processing.

[33]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[34]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[35]  Xiao Li,et al.  Regularized Adaptation of Discriminative Classifiers , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[36]  Wu Chou,et al.  A Novel Learning Method for Hidden Markov Models in Speech and Audio Processing , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[37]  Pietro Laface,et al.  Linear hidden transformations for adaptation of hybrid ANN/HMM models , 2007, Speech Commun..

[38]  Chin-Hui Lee,et al.  Towards bottom-up continuous phone recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[39]  Eric Fosler-Lussier,et al.  Joint Versus Independent Phonological Feature Models within CRF Phone Recognition , 2007, HLT-NAACL.

[40]  Chin-Hui Lee,et al.  High-Accuracy Phone Recognition By Combining High-Performance Lattice Generation and Knowledge Based Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[41]  James Glass,et al.  Research Developments and Directions in Speech Recognition and Understanding, Part 1 , 2009 .

[42]  James R. Glass,et al.  Developments and directions in speech recognition and understanding, Part 1 [DSP Education] , 2009, IEEE Signal Processing Magazine.

[43]  Chin-Hui Lee,et al.  Exploring universal attribute characterization of spoken languages for spoken language recognition , 2009, INTERSPEECH.

[44]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[45]  Chin-Hui Lee,et al.  Experimental studies on continuous speech recognition using neural architectures with “adaptive” hidden activation functions , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Georg Heigold,et al.  A discriminative splitting criterion for phonetic decision trees , 2010, INTERSPEECH.

[47]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[48]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[49]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Dau-Cheng Lyu,et al.  Experiments on Cross-Language Attribute Detection and Phone Recognition With Minimal Target-Specific Training Data , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Jinyu Li,et al.  Hermitian based Hidden Activation Functions for Adaptation of Hybrid HMM/ANN Models , 2012, INTERSPEECH.

[52]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[53]  Yifan Gong,et al.  Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.

[54]  Georg Heigold,et al.  Multiframe deep neural networks for acoustic modeling , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  Chin-Hui Lee,et al.  Exploiting deep neural networks for detection-based speech recognition , 2013, Neurocomputing.