Adaptation of front end parameters in a speech recognizer

In this paper we consider the problem of adapting parameters of the algorithm used for extraction of features. Typical speech recognition systems use a sequence of modules to extract features which are then used for recognition. We present a method to adapt the parameters in these modules under a variety of criteria, e.g maximum likelihood, maximum mutual information. This method works under the assumption that the functions that the modules implement are differentiable with respect to their inputs and parameters. We use this framework to optimize a linear transform preceding the linear discriminant analysis (LDA) matrix and show that it gives significantly better performance than a linear transform after the LDA matrix with small amounts of data. We show that linear transforms can be estimated by directly optimizing likelihood or the MMI objective without using auxiliary functions. We also apply the method to optimize the Mel bins, and the compression power in a system that uses power law compression.

[1]  Mukund Padmanabhan,et al.  A nonlinear unsupervised adaptation technique for speech recognition , 2000, INTERSPEECH.

[2]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[3]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[4]  Scott Axelrod,et al.  Subspace constrained Gaussian mixture models for speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[5]  William J. Byrne,et al.  Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation , 2005, IEEE Transactions on Speech and Audio Processing.

[6]  H. Wakita Normalization of vowels by vocal-tract length and its application to vowel identification , 1977 .

[7]  Benoît Maison,et al.  A robust high accuracy speech recognition system for mobile applications , 2002, IEEE Trans. Speech Audio Process..

[8]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[9]  Geoffrey Zweig,et al.  Linear feature space projections for speaker adaptation , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  David J. Thuente,et al.  Line search algorithms with guaranteed sufficient decrease , 1994, TOMS.

[11]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[12]  Yoshua Bengio,et al.  Global training of document processing systems using graph transformer networks , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.