Comparing human and automatic speech recognition in a perceptual restoration experiment

HighlightsMissing-data methods are evaluated in a perceptual restoration task.Human and automatic speech recognition performance are compared.Methods include a novel approach to cepstral-domain bounded marginalisation. Speech that has been distorted by introducing spectral or temporal gaps is still perceived as continuous and complete by human listeners, so long as the gaps are filled with additive noise of sufficient intensity. When such perceptual restoration occurs, the speech is also more intelligible compared to the case in which noise has not been added in the gaps. This observation has motivated so-called 'missing data' systems for automatic speech recognition (ASR), but there have been few attempts to determine whether such systems are a good model of perceptual restoration in human listeners. Accordingly, the current paper evaluates missing data ASR in a perceptual restoration task. We evaluated two systems that use a new approach to bounded marginalisation in the cepstral domain, and a bounded conditional mean imputation method. Both methods model available speech information as a clean-speech posterior distribution that is subsequently passed to an ASR system. The proposed missing data ASR systems were evaluated using distorted speech, in which spectro-temporal gaps were optionally filled with additive noise. Speech recognition performance of the proposed systems was compared against a baseline ASR system, and with human speech recognition performance on the same task. We conclude that missing data methods improve speech recognition performance in a manner that is consistent with perceptual restoration in human listeners.

[1]  DeLiang Wang,et al.  A schema-based model for phonemic restoration , 2005, Speech Commun..

[2]  Ramón Fernández Astudillo,et al.  An Uncertainty Propagation Approach to Robust ASR Using the ETSI Advanced Front-End , 2010, IEEE Journal of Selected Topics in Signal Processing.

[3]  Mark A. Clements,et al.  Using observation uncertainty in HMM decoding , 2002, INTERSPEECH.

[4]  Juha Häkkinen,et al.  On the Use of Missing Feature Theory with Cepstral Features , 2022 .

[5]  Li Deng,et al.  Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion , 2005, IEEE Transactions on Speech and Audio Processing.

[6]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[7]  A. Bregman Auditory Scene Analysis , 2008 .

[8]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[9]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Guy J. Brown,et al.  Computational auditory scene analysis: Exploiting principles of perceived continuity , 1993, Speech Commun..

[11]  G. L. Powers,et al.  Intelligibility of temporally interrupted speech with and without intervening noise. , 1973, The Journal of the Acoustical Society of America.

[12]  R M Warren,et al.  Perceptual restoration of obliterated sounds. , 1984, Psychological bulletin.

[13]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[14]  Phil D. Green,et al.  Handling missing data in speech recognition , 1994, ICSLP.

[15]  Jean Paul Haton,et al.  On noise masking for automatic missing data speech recognition: A survey and discussion , 2007, Comput. Speech Lang..

[16]  Daniel P. W. Ellis,et al.  Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures , 1999, Speech Commun..

[17]  Mikko Kurimo,et al.  Bounded Conditional Mean Imputation with Observation Uncertainties and Acoustic Model Adaptation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  B H Repp,et al.  Perceptual restoration of a “missing” speech sound: Auditory induction or illusion? , 1992, Perception & psychophysics.

[19]  R. M. Warren Perceptual Restoration of Missing Speech Sounds , 1970, Science.

[20]  Friedrich Faubel,et al.  Bounded conditional mean imputation with Gaussian mixture models: A reconstruction approach to partly occluded features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Jon Barker,et al.  Missing-Data Techniques: Recognition with Incomplete Spectrograms , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[22]  A. Samuel Phonemic restoration: insights from a new methodology. , 1981, Journal of experimental psychology. General.

[23]  Hideki Kawahara,et al.  Dynamic sound stream formation based on continuity of spectral change , 1999, Speech Commun..

[24]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[25]  J A Bashford,et al.  Spectral restoration of speech: Intelligibility is increased by inserting noise in spectral gaps , 1997, Perception & psychophysics.

[26]  J A Bashford,et al.  Increasing the intelligibility of speech through multiple phonemic restorations. , 1990, Perception & psychophysics.

[27]  Charles Speaks,et al.  Intelligibility of temporally interrupted speech. , 1971, The Journal of the Acoustical Society of America.

[28]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[29]  G. A. Miller,et al.  The Intelligibility of Interrupted Speech , 1948 .

[30]  Ning Ma,et al.  MMSE-Based Missing-Feature Reconstruction With Temporal Modeling for Robust Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  John F Culling,et al.  The spatial unmasking of speech: evidence for within-channel processing of interaural time delay. , 2005, The Journal of the Acoustical Society of America.

[32]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[33]  Jeff A. Bilmes,et al.  Noise robustness in automatic speech recognition , 2004 .

[34]  J Verschuure,et al.  Intelligibility of interrupted meaningful and nonsense speech with and without intervening noise , 1983, Perception & psychophysics.

[35]  James A. Bashford,et al.  Increasing the intelligibility of speech through multiple phonemic restorations , 1992 .

[36]  R. M. Warren,et al.  Phonemic restorations based on subsequent context , 1974 .

[37]  Ulpu Remes,et al.  Missing-Data Techniques: Feature Reconstruction , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[38]  Mikko Kurimo,et al.  Duration modeling techniques for continuous speech recognition , 2004, INTERSPEECH.

[39]  Yoshihiko Nankaku,et al.  GMM-Based Missing-Feature Reconstruction on Multi-Frame Windows , 2011, INTERSPEECH.

[40]  J. Deutsch,et al.  Behavioral Measurement of Neural Poststimulation Excitability Cycle: Pain Cells in the Brain of the Rat , 1970, Science.

[41]  A. Nadas,et al.  Speech recognition using noise-adaptive prototypes , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.