Sparse Multilayer Perceptron for Phoneme Recognition

This paper introduces the sparse multilayer perceptron (SMLP) which jointly learns a sparse feature representation and nonlinear classifier boundaries to optimally discriminate multiple output classes. SMLP learns the transformation from the inputs to the targets as in multilayer perceptron (MLP) while the outputs of one of the internal hidden layers is forced to be sparse. This is achieved by adding a sparse regularization term to the cross-entropy cost and updating the parameters of the network to minimize the joint cost. On the TIMIT phoneme recognition task, SMLP-based systems trained on individual speech recognition feature streams perform significantly better than the corresponding MLP-based systems. Phoneme error rate of 19.6% is achieved using the combination of SMLP-based systems, a relative improvement of 3.0% over the combination of MLP-based systems.

[1]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[2]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[3]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[4]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[5]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[6]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[7]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  David Gelbart,et al.  Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[9]  Nelson Morgan,et al.  Learning long-term temporal features in LVCSR using neural networks , 2004, INTERSPEECH.

[10]  Andreas Stolcke,et al.  On using MLP features in LVCSR , 2004, INTERSPEECH.

[11]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[12]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[13]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[14]  Ke Huang,et al.  Sparse Representation for Signal Classification , 2006, NIPS.

[15]  Fabio Valente,et al.  Combination of Acoustic Classifiers Based on Dempster-Shafer Theory of Evidence , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[16]  Roger B. Grosse,et al.  Shift-Invariance Sparse Coding for Audio Classification , 2007, UAI.

[17]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[18]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[19]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  Hynek Hermansky,et al.  Exploiting contextual information for improved phoneme recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Nelson Morgan,et al.  Multi-stream spectro-temporal features for robust speech recognition , 2008, INTERSPEECH.

[22]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[23]  Sridhar Krishna Nemala,et al.  Discriminant spectrotemporal features for phoneme recognition , 2009, INTERSPEECH.

[24]  Hervé Bourlard,et al.  MLP based hierarchical system for task adaptation in ASR , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[25]  Hynek Hermansky,et al.  Modulation frequency features for phoneme recognition in noisy speech. , 2009, The Journal of the Acoustical Society of America.

[26]  Mark J. F. Gales,et al.  Training and adapting MLP features for Arabic speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Hervé Bourlard,et al.  Hierarchical multilayer perceptron based language identification , 2010, INTERSPEECH.

[29]  Geoffrey E. Hinton,et al.  Phone recognition using Restricted Boltzmann Machines , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Tuomas Virtanen,et al.  Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Hynek Hermansky,et al.  Sparse auto-associative neural networks: theory and application to speech recognition , 2010, INTERSPEECH.

[32]  Sridhar Krishna Nemala,et al.  Sparse coding for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Tara N. Sainath,et al.  Bayesian compressive sensing for phonetic classification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[35]  Hervé Bourlard,et al.  Enhanced Phone Posteriors for Improving Speech Recognition Systems , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Hynek Hermansky,et al.  Multilayer perceptron with sparse hidden outputs for phoneme recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Hynek Hermansky,et al.  Analysis of MLP-Based Hierarchical Phoneme Posterior Probability Estimator , 2011, IEEE Transactions on Audio, Speech, and Language Processing.