Scaled norm-based Euclidean projection for sparse speaker adaptation

To reduce data storage for speaker adaptive (SA) models, in our previous work, we proposed a sparse speaker adaptation method which can efficiently reduce the number of adapted parameters by using Euclidean projection onto the L1-ball (EPL1) while maintaining recognition performance comparable to maximum a posteriori (MAP) adaptation. In the EPL1-based sparse speaker adaptation framework, however, the adapted Gaussian mean vectors are mostly concentrated on dimensions having large variances because of assuming unit variance for all dimensions. To make EPL1 more flexible, in this paper, we propose scaled norm-based Euclidean projection (SNEP) which can consider dimension-specific variances. By using SNEP, we also propose a new sparse speaker adaptation method which can consider the variances of a speaker-independent model. Our experiments show that the adapted components of mean vectors are evenly distributed in all dimensions, and we can obtain sparsely adapted models with no loss of phone recognition performance from the proposed method compared with MAP adaptation.

[1]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[2]  Hoirin Kim,et al.  Constrained MLE-based speaker adaptation with L1 regularization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jun Liu,et al.  Efficient Euclidean projections in linear time , 2009, ICML '09.

[4]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[7]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[9]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[11]  Jing Huang,et al.  Affine invariant sparse maximum a posteriori adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[13]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[14]  Mário A. T. Figueiredo,et al.  Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems , 2007, IEEE Journal of Selected Topics in Signal Processing.

[15]  S. Kotz,et al.  The Laplace Distribution and Generalizations , 2012 .

[16]  Changshui Zhang,et al.  Efficient Euclidean projections via Piecewise Root Finding and its application in gradient projection , 2011, Neurocomputing.

[17]  Jing Huang,et al.  Sparse Maximum A Posteriori adaptation , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[18]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[19]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[20]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[21]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[22]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[24]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[25]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[27]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .