Improved automatic speech recognition system using sparse decomposition by basis pursuit with deep rectifier neural networks and compressed sensing recomposition of speech signals

Research on the common limitations of Automatic Speech Recognition (ASR) systems state problems ranging from environmental noise, and channel or speaker variability to the limitations imposed by the measurement device. In mobile applications for automatic speech recognition, the Nyquist criteria imposes more limitations on the sampling rate at which a device can acquire signal, often the lack of fidelity of the acquired signal causing bad speech recognition. This is a specific problem for mobile devices (which are also, nowadays, the prime beneficiaries of speech recognition applications) as in this case the sampling rate is limited. We envisage a way to get the best out of any acquired signal, by use of sparsity decomposition algorithms and compressed sensing recomposition. We focus on the fact that complex sounds can be viewed as an overlapping of a number of sounds coming from simple sparse sources. Therefore, we decompose the measured signal in a linear combination of simple sparse signals and we reconstruct each sparse signal by means of compressed sensing recomposition in order to gain a better signal fidelity. We make use of deep rectifier neural network designed to decompose a training set of signals and compute a specific dictionary with simple sparse signals. The resulted sparse signals are used for decomposing the acquired signal by means of sparse algorithms, and, consequently, the resulted combination of sparse signals will be used for signal reconstruction in a compressed sensing algorithm. We test the framework for different simulated speech signals, as well as its usability in automatic speech recognition, discussing the improvements this upgrade brings to an ASR. In this paper we will describe the framework and the algorithms used and present the experimental results.

[1]  S. Laughlin,et al.  An Energy Budget for Signaling in the Grey Matter of the Brain , 2001, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[2]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[3]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[4]  Petros Boufounos,et al.  Reconstruction of sparse signals from distorted randomized measurements , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Shrikanth Narayanan,et al.  Enhanced Sparse Imputation Techniques for a Robust Speech Recognition Front-End , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Olgica Milenkovic,et al.  Subspace Pursuit for Compressive Sensing Signal Reconstruction , 2008, IEEE Transactions on Information Theory.

[7]  Lawrence K. Saul,et al.  Sparse decomposition of mixed audio signals by basis pursuit with autoregressive models , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Satoshi Tamura,et al.  Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition , 2013, AVSP.

[9]  Lin Cong,et al.  Robust speech recognition using neural networks and hidden Markov models , 2000, Proceedings International Conference on Information Technology: Coding and Computing (Cat. No.PR00540).

[10]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[11]  Peter Karsmakers,et al.  Sparse Kernel-Based Models for Speech Recognition (Spaarse kernel gebaseerde modellen voor spraakherkenning) , 2010 .

[12]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[13]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[14]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[15]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[16]  Daniel P. W. Ellis,et al.  Speech and Audio Signal Processing - Processing and Perception of Speech and Music, Second Edition , 1999 .

[17]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..