Distant microphone speech recognition in a noisy indoor environment: combining soft missing data and speech fragment decoding

This paper examines the problem of distant microphone speech recognition in noisy indoor home environments. The noise background can be roughly characterised in terms of a slowly varying noise floor in which there are embedded a mixture of energetic but unpredictable acoustic events. Our solution to the problem combines two complementary techniques. First, a soft missing data mask is formed which estimates the degree to which energetic acoustic events are masked by the noise floor. This step relies on a simple adaptive noise model. Second, a fragment decoding system attempts to interpret the energetic regions that are not accounted for by the noise floor model. This component uses models of the target speech to decide whether fragments (time-frequency regions dominated by a single sound source) should be included in the target speech stream or not. This combined approach is able to achieve a performance that is modestly superior to that achieved using speech fragment decoding without an adaptive noise floor. Our experiments also show that speech fragment decoding performs far better than soft missing data decoding in variable noise, achieving 73% keyword recognition accuracy at -6 dB SNR on the Grid corpus task and substantially outperforming multicondition training.

[1]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[3]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[4]  Mark J. F. Gales,et al.  HMM recognition in noise using parallel model combination , 1993, EUROSPEECH.

[5]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[6]  Jon Barker,et al.  Soft decisions in missing data techniques for robust automatic speech recognition , 2000, INTERSPEECH.

[7]  Brendan J. Frey,et al.  ALGONQUIN: iterating laplace's method to remove multiple types of acoustic distortion for robust speech recognition , 2001, INTERSPEECH.

[8]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[9]  Jon Barker,et al.  Robust ASR based on clean speech models: an evaluation of missing data techniques for connected digit recognition in noise , 2001, INTERSPEECH.

[10]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[11]  Ning Ma,et al.  Exploiting dendritic autocorrelogram structure to identify spectro-temporal regions dominated by a single sound source , 2006, INTERSPEECH.

[12]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[13]  Ning Ma,et al.  Exploiting correlogram structure for robust speech recognition with multiple speech sources , 2007, Speech Commun..

[14]  Ning Ma,et al.  Integrating pitch and localisation cues at a speech fragment level , 2007, INTERSPEECH.

[15]  A. Bregman Auditory Scene Analysis , 2008 .

[16]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[17]  Ning Ma,et al.  The CHiME corpus: a resource and a challenge for computational hearing in multisource environments , 2010, INTERSPEECH.

[18]  Ning Ma,et al.  Speech fragment decoding techniques for simultaneous speaker identification and speech recognition , 2010, Comput. Speech Lang..