Background of robust speech recognition

In this chapter, we establish some fundamental concepts that are most relevant to the discussions in the remainder of this book. We first model distortions of speech in acoustic environments. We show how noise impacts Gaussian models and explain why deep neural network (DNN) models are more robust to noise. We then formulate a general framework for noise-robust ASR as the foundation of a wide variety of techniques presented in the remainder of this book. Based on the general framework described in this chapter, we provide a comprehensive overview, in a mathematically rigorous and unified manner, of noise-robust ASR using five different ways of categorizing, analyzing, and characterizing major existing techniques: (1) feature-domain vs. model-domain processing, (2) the use of prior knowledge about the acoustic environment distortion, (3) the use of explicit environment-distortion models, (4) deterministic vs. uncertainty processing, and (5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the test stage. In the following five separate chapters, we will present and analyze the representative methods according to each of the five ways in categorizing the attributes of these methods.

[1]  Brendan J. Frey,et al.  ALGONQUIN: iterating laplace's method to remove multiple types of acoustic distortion for robust speech recognition , 2001, INTERSPEECH.

[2]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Li Deng,et al.  Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion , 2005, IEEE Transactions on Speech and Audio Processing.

[4]  Li Deng,et al.  A Bayesian approach to speech feature enhancement using the dynamic cepstral prior , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[6]  Jacob Benesty,et al.  Spectral Enhancement Methods , 2009 .

[7]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[8]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[9]  Li Deng,et al.  MiPad: a multimodal interaction prototype , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Reinhold Häb-Umbach,et al.  An analytic derivation of a phase-sensitive observation model for noise robust speech recognition , 2009, INTERSPEECH.

[11]  Reinhold Häb-Umbach,et al.  A Novel Uncertainty Decoding Rule With Applications to Transmission Error Robust Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Ning Ma,et al.  The CHiME corpus: a resource and a challenge for computational hearing in multisource environments , 2010, INTERSPEECH.

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[15]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Chin-Hui Lee,et al.  On stochastic feature and model compensation approaches to robust speech recognition , 1998, Speech Commun..

[18]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[19]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[20]  Li Deng,et al.  Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features , 2004, IEEE Transactions on Speech and Audio Processing.

[21]  H. Bourlard,et al.  Interpretation of Multiparty Meetings the AMI and Amida Projects , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[22]  Khalid Choukri,et al.  SPEECHDAT-CAR. A Large Speech Database for Automotive Environments , 2000, LREC.

[23]  Li Deng,et al.  A comparison of three non-linear observation models for noisy speech features , 2003, INTERSPEECH.

[24]  Xiao Li,et al.  Machine Learning Paradigms for Speech Recognition: An Overview , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Khalid Choukri,et al.  The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms , 2007, Lang. Resour. Evaluation.

[26]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[27]  Li Deng,et al.  Enhancement of log Mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise , 2004, IEEE Transactions on Speech and Audio Processing.

[28]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[29]  Reinhold Häb-Umbach Uncertainty Decoding and Conditional Bayesian Estimation , 2011, Robust Speech Recognition of Uncertain or Missing Data.

[30]  Brendan J. Frey,et al.  ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition , 2001, NIPS.

[31]  Li Deng,et al.  Front-End, Back-End, and Hybrid Techniques for Noise-Robust Speech Recognition , 2011, Robust Speech Recognition of Uncertain or Missing Data.

[32]  Emmanuel Vincent,et al.  An investigation of likelihood normalization for robust ASR , 2014, INTERSPEECH.

[33]  Jeff A. Bilmes,et al.  The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments , 2012, Comput. Speech Lang..

[34]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.