Parallel and Hierarchical Decision Making for Sparse Coding in Speech Recognition

Sparse coding exhibits promising performance in speech processing, mainly due to the large number of bases that can be used to represent speech signals. However, the high demand for computational power represents a major obstacle in the case of large datasets, as does the difficulty in utilising information scattered sparsely in high dimensional features. This paper reports the use of an online dictionary learning technique, proposed recently by the machine learning community, to learn large scale bases efficiently, and proposes a new parallel and hierarchical architecture to make use of the sparse information in high dimensional features. The approach uses multilayer perceptrons (MLPs) to model sparse feature subspaces and make local decisions accordingly; the latter are integrated by additional MLPs in a hierarchical way for making global decisions. Experiments on the WSJ database show that the proposed approach not only solves the problem of prohibitive computation with large-dimensional sparse features, but also provides better performance in a frame-level phone prediction task.

[1]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[2]  Mark D. Plumbley,et al.  Sparse Coding for Convolutive Blind Audio Source Separation , 2006, ICA.

[3]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[4]  Sridhar Krishna Nemala,et al.  Sparse coding for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Joachim M. Buhmann,et al.  Speech enhancement with sparse coding in learned dictionaries , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  J. Larsen,et al.  Wind Noise Reduction using Non-Negative Sparse Coding , 2007, 2007 IEEE Workshop on Machine Learning for Signal Processing.

[7]  Martin Heckmann Supervised vs. unsupervised learning of spectro temporal speech features , 2010, SAPA@INTERSPEECH.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Etienne Barnard,et al.  Continuous speech recognition with sparse coding , 2009, Comput. Speech Lang..

[10]  Tuomas Virtanen,et al.  Separation of sound sources by convolutive sparse coding , 2004, SAPA@INTERSPEECH.

[11]  Peter Földiák,et al.  SPARSE CODING IN THE PRIMATE CORTEX , 2002 .

[12]  Hervé Bourlard,et al.  An introduction to the hybrid hmm/connectionist approach , 1995 .

[13]  Hynek Hermansky,et al.  Sub-band based recognition of noisy speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[15]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).