Sparse imputation for large vocabulary noise robust ASR

An effective way to increase noise robustness in automatic speech recognition is to label the noisy speech features as either reliable or unreliable ('missing'), and replace ('impute') the missing ones by clean speech estimates. Conventional imputation techniques employ parametric models and impute the missing features on a frame-by-frame basis. At low SNRs, frame-based imputation techniques fail because many time frames contain few, if any, reliable features. In previous work, we introduced an exemplar-based method, dubbed sparse imputation, which can impute missing features using reliable features from neighbouring frames. We achieved substantial gains in performance at low SNRs for a connected digit recognition task. In this work, we investigate whether the exemplar-based approach can be generalised to a large vocabulary task. Experiments on artificially corrupted speech show that sparse imputation substantially outperforms a conventional imputation technique when the ideal 'oracle' reliability of features is used. With error-prone estimates of feature reliability, sparse imputation performance is comparable to our baseline imputation technique in the cleanest conditions, and substantially better at lower SNRs. With noisy speech recorded in realistic noise conditions, sparse imputation performs slightly worse than our baseline imputation technique in the cleanest conditions, but substantially better in the noisier conditions.

[1]  Jon Barker,et al.  Soft decisions in missing data techniques for robust automatic speech recognition , 2000, INTERSPEECH.

[2]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[3]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[4]  Paul Dalsgaard,et al.  Exploiting Temporal Correlation of Speech for Error Robust and Bandwidth Flexible Distributed Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[6]  Daniel P. W. Ellis,et al.  Estimating single-channel source separation masks: relevance vector machine classifiers vs. pitch-based masking , 2006, SAPA@INTERSPEECH.

[7]  Vesa Siivola,et al.  Growing an n-gram language model , 2005, INTERSPEECH.

[8]  Guy J. Brown,et al.  Mask estimation for missing data speech recognition based on statistics of binaural interaction , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Bert Cranen,et al.  Using sparse representations for missing data imputation in noise robust speech recognition , 2008, 2008 16th European Signal Processing Conference.

[10]  Phil D. Green,et al.  Handling missing data in speech recognition , 1994, ICSLP.

[11]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ulpu Remes,et al.  Observation uncertainty measures for sparse imputation , 2010, INTERSPEECH.

[13]  Hugo Van hamme,et al.  PROSPECT features and their application to missing data techniques for robust speech recognition , 2004, INTERSPEECH.

[14]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[15]  Mark J. F. Gales,et al.  Issues with uncertainty decoding for noise robust automatic speech recognition , 2008, Speech Commun..

[16]  Jon Barker,et al.  Robust ASR based on clean speech models: an evaluation of missing data techniques for connected digit recognition in noise , 2001, INTERSPEECH.

[17]  D. Donoho For most large underdetermined systems of equations, the minimal 𝓁1‐norm near‐solution approximates the sparsest near‐solution , 2006 .

[18]  Veronique Stouten,et al.  Robust Automatic Speech Recognition in Time-Varying Environments (Robuuste automatische spraakherkenning in een tijdsvariërende omgeving) , 2006 .

[19]  A. Nadas,et al.  Speech recognition using noise-adaptive prototypes , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[20]  Phil D. Green,et al.  Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: an integrated study , 1999, EUROSPEECH.

[21]  Jean Paul Haton,et al.  On noise masking for automatic missing data speech recognition: A survey and discussion , 2007, Comput. Speech Lang..

[22]  Hugo Van hamme Robust speech recognition using cepstral domain missing data techniques and noisy masks , 2004, ICASSP.

[23]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale $\ell_1$-Regularized Least Squares , 2007, IEEE Journal of Selected Topics in Signal Processing.

[24]  Janne Pylkkönen AN EFFICIENT ONE-PASS DECODER FOR FINNISH LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , .

[25]  Hugo Van hamme,et al.  Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition , 2010, IEEE Journal of Selected Topics in Signal Processing.

[26]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[27]  Bert Cranen,et al.  Missing data imputation using compressive sensing techniques for connected digit recognition , 2009, 2009 16th International Conference on Digital Signal Processing.

[28]  Hugo Van hamme,et al.  Vector-quantization based mask estimation for missing data automatic speech recognition , 2007, INTERSPEECH.

[29]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[30]  Bert Cranen,et al.  Sparse imputation for noise robust speech recognition using soft masks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[32]  Richard M. Stern,et al.  Reconstruction of incomplete spectrograms for robust speech recognition , 2000 .

[33]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[34]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[35]  Hugo Van hamme,et al.  Application of noise robust MDT speech recognition on the SPEECON and speechdat-car databases , 2009, INTERSPEECH.

[36]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[37]  Phil D. Green,et al.  State based imputation of missing data for robust speech recognition and speech enhancement , 1999, EUROSPEECH.

[38]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[39]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[40]  Richard M. Stern,et al.  Band-Independent Mask Estimation for Missing-Feature Reconstruction in the Presence of Unknown Background Noise , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[41]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[42]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition: Fundamentals and Applications , 1995 .

[43]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[44]  Maarten Van Segbroeck Robust Large Vocabulary Continuous Speech Recognition Using Missing Data Techniques (Robuuste spraakherkenning voor groot vocabularium gebruik makend van de techniek van de ontbrekende data) , 2010 .

[45]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[46]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[47]  Mikko Kurimo,et al.  Duration modeling techniques for continuous speech recognition , 2004, INTERSPEECH.

[48]  Ning Ma,et al.  Exploiting correlogram structure for robust speech recognition with multiple speech sources , 2007, Speech Commun..

[49]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[50]  Mikko Kurimo,et al.  Missing feature reconstruction and acoustic model adaptation combined for large vocabulary continuous speech recognition , 2008, 2008 16th European Signal Processing Conference.