Low-rank and sparse subspace modeling of speech for DNN based acoustic modeling

Abstract Towards the goal of improving acoustic modeling for automatic speech recognition (ASR), this work investigates the modeling of senone subspaces in deep neural network (DNN) posteriors using low-rank and sparse modeling approaches. While DNN posteriors are typically very high-dimensional, recent studies have shown that the true class information is actually embedded in low-dimensional subspaces. Thus, a matrix of all posteriors belonging to a particular senone class is expected to have a very low rank. In this paper, we exploit Principal Component Analysis and Compressive Sensing based dictionary learning for low-rank and sparse modeling of senone subspaces respectively. Our hypothesis is that the principal components of DNN posterior space (termed as eigen-posteriors in this work) and Compressive Sensing dictionaries can act as suitable models to extract the well-structured low-dimensional latent information and discard the undesirable high-dimensional unstructured noise present in the posteriors. Enhanced DNN posteriors thus obtained are used as soft targets for training better acoustic models to improve ASR. In this context, our approach also enables improving distant speech recognition by mapping far-field acoustic features to low-dimensional senone subspaces learned from near-field features. Experiments are performed on AMI Meeting corpus in both close-talk (IHM) and far-field (SDM) microphone settings where acoustic models trained using enhanced DNN posteriors outperform the conventional hard target based hybrid DNN-HMM systems. An information theoretic analysis is also presented to show how low-rank and sparse enhancement modify the DNN posterior space to better match the assumptions of hidden Markov model (HMM) backend.

[1]  Hervé Bourlard,et al.  Exploiting Eigenposteriors for Semi-Supervised Training of DNN Acoustic Models with Sequence Discrimination , 2017, INTERSPEECH.

[2]  Hervé Bourlard,et al.  Sparse modeling of posterior exemplars for keyword detection , 2015, INTERSPEECH.

[3]  Koichi Shinoda,et al.  Wise teachers train better DNN acoustic models , 2016, EURASIP J. Audio Speech Music. Process..

[4]  Hervé Bourlard,et al.  Low-Rank Representation of Nearest Neighbor Phone Posterior Probabilities to Enhance DNN Acoustic Modeling , 2016, Interspeech 2016.

[5]  Paul W. Fieguth,et al.  A functional articulatory dynamic model for speech production , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Mark J. F. Gales,et al.  Sequence Student-Teacher Training of Deep Neural Networks , 2016, INTERSPEECH.

[7]  William Chan,et al.  Transferring knowledge from a RNN to a DNN , 2015, INTERSPEECH.

[8]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[12]  René Vidal,et al.  Sparse Subspace Clustering: Algorithm, Theory, and Applications , 2012, IEEE transactions on pattern analysis and machine intelligence.

[13]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[14]  Geoffrey E. Hinton,et al.  Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models , 2014, INTERSPEECH.

[15]  Mari Ostendorf,et al.  A Sparse Plus Low-Rank Exponential Language Model for Limited Resource Scenarios , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Dong Yu,et al.  Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[18]  Steve Renals,et al.  Convolutional Neural Networks for Distant Speech Recognition , 2014, IEEE Signal Processing Letters.

[19]  Jungwon Lee,et al.  Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition , 2017, INTERSPEECH.

[20]  Yifan Gong,et al.  Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.

[21]  Jeff A. Bilmes,et al.  What HMMs Can Do , 2006, IEICE Trans. Inf. Syst..

[22]  Jonathan G. Fiscus,et al.  Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech , 2006, LREC.

[23]  Afsaneh Asaei,et al.  Sparse Subspace Modeling for Query by Example Spoken Term Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[25]  Hervé Bourlard,et al.  Exploiting low-dimensional structures to enhance DNN based acoustic modeling in speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Jeff A. Bilmes,et al.  WHAT HMMS CAN'T DO , 2004 .

[27]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[28]  Anil Kumar Sao,et al.  Deep-Sparse-Representation-Based Features for Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[30]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[31]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[32]  Li Deng,et al.  Switching Dynamic System Models for Speech Articulation and Acoustics , 2004 .

[33]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[34]  Hervé Bourlard,et al.  Low-rank and sparse soft targets to learn better DNN acoustic models , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Meng Cai,et al.  Neuron sparseness versus connection sparseness in deep neural network for large vocabulary speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[37]  Dong Yu,et al.  An investigation into using parallel data for far-field speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Jun Du,et al.  Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[39]  Yong Yu,et al.  Robust Recovery of Subspace Structures by Low-Rank Representation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Georg Heigold,et al.  GMM-free DNN acoustic model training , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Jungwon Lee,et al.  Bridgenets: Student-Teacher Transfer Learning Based on Recursive Neural Networks and Its Application to Distant Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[44]  Petr Motlícek,et al.  Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Jean Ponce,et al.  Sparse Modeling for Image and Vision Processing , 2014, Found. Trends Comput. Graph. Vis..

[46]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[47]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[48]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[49]  Larry Gillick,et al.  Don't multiply lightly: Quantifying problems with the acoustic model assumptions in speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[50]  Richard C. Rose,et al.  Manifold regularized deep neural networks , 2014, INTERSPEECH.

[51]  Hervé Bourlard,et al.  Subspace Detection of DNN Posterior Probabilities via Sparse Representation for Query by Example Spoken Term Detection , 2016, INTERSPEECH.

[52]  Souvik Kundu,et al.  Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[53]  Yiming Wang,et al.  Far-Field ASR Without Parallel Data , 2016, INTERSPEECH.