Bounded Conditional Mean Imputation with Observation Uncertainties and Acoustic Model Adaptation

Automatic speech recognition systems use noise compensation and acoustic model adaptation to increase robustness towards speaker and environmental variation. The current work focuses on noise compensation with bounded conditional mean imputation (BCMI). BCMI approaches are missing-data methods which operate on the assumption that noise-corrupted observations can be divided into reliable and unreliable components. BCMI methods substitute the unreliable components with a clean speech posterior distribution. The posterior means can be used as clean speech estimates and the posterior variances can be introduced in acoustic model likelihood calculation as observation uncertainties. In addition, we propose in the current work that similar uncertainties are introduced in acoustic model adaptation. Evaluation with speech data recorded in diverse public and car environments indicates that the proposed uncertainties improve adaptation performance. When uncertainties were used in acoustic model likelihood calculation and adaptation, the proposed imputation and adaptation system introduced 15%-84% relative error reductions to an uncompensated baseline system performance.

[1]  Mikko Kurimo,et al.  Robust automatic speech recognition using acoustic model adaptation prior to missing feature reconstruction , 2009, 2009 17th European Signal Processing Conference.

[2]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[3]  Ulpu Remes,et al.  Noise robust missing data mask estimation based on automatically learned features , 2013 .

[4]  Ning Ma,et al.  Combining missing-data reconstruction and uncertainty decoding for robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Mark A. Clements,et al.  Using observation uncertainty in HMM decoding , 2002, INTERSPEECH.

[6]  Reinhold Häb-Umbach,et al.  MAP-based estimation of the parameters of non-stationary Gaussian processes from noisy observations , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Emmanuel Vincent,et al.  Extension of uncertainty propagation to dynamic MFCCS for noise robust ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tomi Kinnunen,et al.  A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Li Deng,et al.  Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion , 2005, IEEE Transactions on Speech and Audio Processing.

[10]  Ulpu Remes,et al.  Observation uncertainty measures for sparse imputation , 2010, INTERSPEECH.

[11]  Mikko Kurimo,et al.  Duration modeling techniques for continuous speech recognition , 2004, INTERSPEECH.

[12]  N. Sedgwick,et al.  Noise compensation for speech recognition using probabilistic models , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Petros Maragos,et al.  Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[15]  Ulpu Remes Bounded conditional mean imputation with an approximate posterior , 2013, INTERSPEECH.

[16]  Kalle J. Palomäki,et al.  Estimating Uncertainty to Improve Exemplar-Based Feature Enhancement for Noise Robust Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Ramón Fernández Astudillo,et al.  Uncertainty Propagation , 2011, Robust Speech Recognition of Uncertain or Missing Data.

[18]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[19]  Yoshihiko Nankaku,et al.  GMM-Based Missing-Feature Reconstruction on Multi-Frame Windows , 2011, INTERSPEECH.

[20]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[21]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[22]  Hugo Van hamme,et al.  Accounting for the uncertainty of speech estimates in the context of model-based feature enhancement , 2004, INTERSPEECH.

[23]  Mark J. F. Gales,et al.  Issues with uncertainty decoding for noise robust automatic speech recognition , 2008, Speech Commun..

[24]  Ramón Fernández Astudillo,et al.  An Uncertainty Propagation Approach to Robust ASR Using the ETSI Advanced Front-End , 2010, IEEE Journal of Selected Topics in Signal Processing.

[25]  Alex Acero,et al.  Separating Speaker and Environmental Variability Using Factored Transforms , 2011, INTERSPEECH.

[26]  Yongqiang Wang,et al.  Speaker and Noise Factorization for Robust Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Guy J. Brown,et al.  Recognition of Reverberant Speech by Missing Data Imputation and NMF Feature Enhancement , 2014 .

[28]  Shinji Watanabe,et al.  Static and Dynamic Variance Compensation for Recognition of Reverberant Speech With Dereverberation Preprocessing , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Hugo Van hamme,et al.  Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition , 2010, IEEE Journal of Selected Topics in Signal Processing.

[31]  R. Orglmeister,et al.  Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[32]  Friedrich Faubel,et al.  A Comparative Study of Missing Feature Imputation Techniques , 2012, ITG Conference on Speech Communication.

[33]  Reinhold Häb-Umbach,et al.  GMM-based significance decoding , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[35]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[36]  Mark J. F. Gales,et al.  Noisy Constrained Maximum-Likelihood Linear Regression for Noise-Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  DeLiang Wang,et al.  Transforming Binary Uncertainties for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Mathieu Lagrange,et al.  Uncertainty-based learning of acoustic models from noisy data , 2013, Comput. Speech Lang..

[39]  Reinhold Häb-Umbach,et al.  Map-based estimation of the parameters of a Gaussian Mixture Model in the presence of noisy observations , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Ning Ma,et al.  MMSE-Based Missing-Feature Reconstruction With Temporal Modeling for Robust Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  B. Eaves On Quadratic Programming , 1971 .

[42]  B. Raj,et al.  Reconstructing spectral vectors with uncertain spectrographic masks for robust speech recognition , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[43]  Friedrich Faubel,et al.  Bounded conditional mean imputation with Gaussian mixture models: A reconstruction approach to partly occluded features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.