Using sparse representations for missing data imputation in noise robust speech recognition

Noise robustness of automatic speech recognition benefits from using missing data imputation: Prior to recognition the parts of the spectrogram dominated by noise are replaced by clean speech estimates. Especially at low SNRs each frame contains at best only a few uncorrupted coefficients. This makes frame-by-frame restoration of corrupted feature vectors error-prone, and recognition accuracy will mostly be sub-optimal. In this paper we present a novel imputation technique working on entire words. A word is sparsely represented in an overcomplete basis of exemplar (clean) speech signals using only the uncorrupted time-frequency elements of the word. The corrupted elements are replaced by estimates obtained by projecting the sparse representation in the basis. We achieve recognition accuracies of 92% at SNR -5 dB using oracle masks on AURORA-2 as compared to 61% using a conventional frame-based approach. The performance obtained with estimated masks can be directly related to the proportion of correctly identified uncorrupted coefficients.

[1]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[2]  Allen Y. Yang,et al.  Feature Selection in Face Recognition: A Sparse Representation Perspective , 2007 .

[3]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[4]  Richard M. Stern,et al.  Reconstruction of incomplete spectrograms for robust speech recognition , 2000 .

[5]  Hugo Van hamme Handling Time-Derivative Features in a Missing Data Framework for Robust Automatic Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Jean Paul Haton,et al.  On noise masking for automatic missing data speech recognition: A survey and discussion , 2007, Comput. Speech Lang..

[7]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[8]  H. Van hamme,et al.  Robust speech recognition using cepstral domain missing data techniques and noisy masks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Hugo Van hamme,et al.  PROSPECT features and their application to missing data techniques for robust speech recognition , 2004, INTERSPEECH.

[10]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[11]  Phil D. Green,et al.  Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: an integrated study , 1999, EUROSPEECH.

[12]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[13]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[14]  Yin Zhang Caam When is missing data recoverable ? , 2006 .

[15]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[16]  E.J. Candes Compressive Sampling , 2022 .

[17]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[18]  D. Donoho For most large underdetermined systems of equations, the minimal 𝓁1‐norm near‐solution approximates the sparsest near‐solution , 2006 .

[19]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.