Recent advances in fragment-based speech recognition in reverberant multisource environments

This paper addresses the problem of speech recognition using distant binaural microphones in reverberant multisource noise conditions. Our scheme employs a two stage fragment decoding approach: first spectro-temporal acoustic source fragments are identified using signal level cues, and second, a hypothesisdriven stage simultaneously searches for the most probable speech/background fragment labelling and the corresponding acoustic model state sequence. The paper reports recent advances in combining adaptive noise floor modelling and binaural localisation cues within this framework. The decoder is able to derive significant recognition performance benefits from both noise floor tracking and fragment location estimates. Using models trained on noise-free speech, the system achieves an average keyword recognition accuracy of 80.60% for the final test set on the PASCAL CHiME Challenge task.

[1]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[2]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[3]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[4]  Philipos C. Loizou,et al.  A noise-estimation algorithm for highly non-stationary environments , 2006, Speech Commun..

[5]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[6]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[7]  Ning Ma,et al.  Exploiting correlogram structure for robust speech recognition with multiple speech sources , 2007, Speech Commun..

[8]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[9]  Ning Ma,et al.  Distant microphone speech recognition in a noisy indoor environment: combining soft missing data and speech fragment decoding , 2010, SAPA@INTERSPEECH.

[10]  Ning Ma,et al.  Integrating pitch and localisation cues at a speech fragment level , 2007, INTERSPEECH.

[11]  Ning Ma,et al.  Binaural Cues for Fragment-Based Speech Recognition in Reverberant Multisource Environments , 2011, INTERSPEECH.

[12]  Ning Ma,et al.  Speech fragment decoding techniques for simultaneous speaker identification and speech recognition , 2010, Comput. Speech Lang..

[13]  Ning Ma,et al.  The CHiME corpus: a resource and a challenge for computational hearing in multisource environments , 2010, INTERSPEECH.

[14]  E. C. Cherry,et al.  Mechanism of Binaural Fusion in the Hearing of Speech , 1957 .

[15]  Jean Paul Haton,et al.  On noise masking for automatic missing data speech recognition: A survey and discussion , 2007, Comput. Speech Lang..

[16]  Richard F. Lyon,et al.  A perceptual pitch detector , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[17]  P. Renevey,et al.  Detection of Reliable Features for Speech Recognition in Noisy Condi-tions Using a Statistical Criterion , 2001 .

[18]  J. Licklider,et al.  A duplex theory of pitch perception , 1951, Experientia.

[19]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[20]  DeLiang Wang,et al.  Binaural Sound Localization , 2006 .

[21]  Combining Speech Fragment Decoding and Adaptive Noise Floor Modeling , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Jon Barker,et al.  Soft decisions in missing data techniques for robust automatic speech recognition , 2000, INTERSPEECH.