A region-specific feature-space transformation for speaker adaptation and singularity analysis of jacobian matrix

In this paper, we present an in-depth analysis of a recently proposed method for speaker adaptation. The method involves a region-specific feature-space transformation, which we refer to as soft R-FMLLR. We argue that the method has certain difficulties, the most significant being the fact that it is noninvertible. An analysis that pertains to the singularity of the Jacobian matrix is presented, from which we note that the matrix becomes near-singular at certain points in the feature space. It indicates that the transformation is non-invertible. We observe that under this case maximum likelihood estimation adversely affects the speech recognition performance. Moreover, sufficient statistics do not exist that makes the estimation procedure computationally very expensive. The concerns outlined above render the method to be unattractive. We propose a simple yet important modification, hard R-FMLLR, and show that the associated Jacobian matrix is assured to be full-rank, and it is computationally efficient. On a large vocabulary continuous speech recognition task the performance of the proposed method is shown to be better than soft R-FMLLR. Further, it is comparable to the widely used CMLLR with regression classes, especially when higher number of transforms are used.

[1]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[2]  W. Rudin Principles of mathematical analysis , 1964 .

[3]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[4]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[5]  Srinivasan Umesh,et al.  Acoustic class specific VTLN-warping using regression class trees , 2009, INTERSPEECH.

[6]  W. Marsden I and J , 2012 .

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[8]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[9]  Mark J. F. Gales,et al.  The generation and use of regression class trees for MLLR adaptation , 1996 .

[10]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[11]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[12]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  Lukás Burget,et al.  The AMIDA 2009 meeting transcription system , 2010, INTERSPEECH.

[14]  Ramesh A. Gopinath,et al.  Feature Adaptation Based on Gaussian Posteriors , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[15]  Mark J. F. Gales,et al.  Joint uncertainty decoding for noise robust speech recognition , 2005, INTERSPEECH.

[16]  G. Fant Non-uniform vowel normalization , 1975 .

[17]  Jan Cernocký,et al.  A factorized representation of FMLLR transform based on QR-decomposition , 2012, INTERSPEECH.