Manifold regularized deep neural networks

Deep neural networks (DNNs) have been successfully applied to a variety of automatic speech recognition (ASR) tasks, both in discriminative feature extraction and hybrid acoustic modeling scenarios. The development of improved loss functions and regularization approaches have resulted in consistent reductions in ASR word error rates (WERs). This paper presents a manifold learning based regularization framework for DNN training. The associated techniques attempt to preserve the underlying low dimensional manifold based relationships amongst speech feature vectors as part of the optimization procedure for estimating network parameters. This is achieved by imposing manifold based locality preserving constraints on the outputs of the network. The techniques are presented in the context of a bottleneck DNN architecture for feature extraction in a tandem configuration. The ASR WER obtained using these networks is evaluated on a speech-in-noise task and compared to that obtained using DNN-bottleneck networks trained without manifold constraints. Index Terms: manifold learning, deep neural networks, speech recognition, tandem feature extraction

[1]  Tara N. Sainath,et al.  Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[2]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[3]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[6]  Kun Zhou,et al.  Locality Sensitive Discriminant Analysis , 2007, IJCAI.

[7]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[8]  Jeff A. Bilmes,et al.  The semi-supervised switchboard transcription project , 2009, INTERSPEECH.

[9]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  Richard C. Rose,et al.  A Family of Discriminative Manifold Learning Algorithms and Their Application to Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Volodymyr Mnih,et al.  CUDAMat: a CUDA-based matrix class for Python , 2009 .

[13]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Richard C. Rose,et al.  Application of a locality preserving discriminant analysis approach to ASR , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[16]  Thomas S. Huang,et al.  Spherical Discriminant Analysis in Semi-supervised Speaker Clustering , 2009, HLT-NAACL.

[17]  Richard C. Rose,et al.  Efficient manifold learning for speech recognition using locality sensitive hashing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Aren Jansen,et al.  Intrinsic Fourier Analysis on the Manifold of Speech Sounds , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[19]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  P. Niyogi,et al.  A Geometric Perspective on Speech Sounds , 2005 .

[21]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[22]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[23]  Richard C. Rose,et al.  A Correlational Discriminant Approach to Feature Extraction for Robust Speech Recognition , 2012, INTERSPEECH.

[24]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[25]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Richard C. Rose,et al.  Locality sensitive hashing for fast computation of correlational manifold learning based feature space transformations , 2013, INTERSPEECH.

[27]  Pascal Vincent,et al.  The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training , 2009, AISTATS.

[28]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Jianshu Chen,et al.  A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property , 2013 .

[30]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[31]  Richard C. Rose,et al.  Noise aware manifold learning for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.