A model for multitalker speech perception.

A listener's ability to understand a target speaker in the presence of one or more simultaneous competing speakers is subject to two types of masking: energetic and informational. Energetic masking takes place when target and interfering signals overlap in time and frequency resulting in portions of target becoming inaudible. Informational masking occurs when the listener is unable to distinguish target and interference, while both are audible. A computational model of multitalker speech perception is presented to account for both types of masking. Human perception in the presence of energetic masking is modeled using a speech recognizer that treats the masked time-frequency units of target as missing data. The effects of informational masking are modeled as errors in target segregation by a speech separation system. On a systematic evaluation, the performance of the proposed model is in broad agreement with the results of a recent perceptual study.

[1]  W. T. Nelson,et al.  A speech corpus for multitalker communications research. , 2000, The Journal of the Acoustical Society of America.

[2]  Charles S. Watson,et al.  Some comments on informational masking , 2005 .

[3]  Wilson P. Tanner What is Masking , 1958 .

[4]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Richard M. Stern,et al.  Classifier-based mask estimation for missing feature methods of robust speech recognition , 2000, INTERSPEECH.

[6]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[7]  DeLiang Wang,et al.  Separation of fricatives and affricates , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  DeLiang Wang,et al.  Integrating computational auditory scene analysis and automatic speech recognition , 2006 .

[9]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[10]  J. Bird Effects of a difference in fundamental frequency in separating two sentences. , 1997 .

[11]  DeLiang Wang,et al.  A schema-based model for phonemic restoration , 2005, Speech Commun..

[12]  DeLiang Wang,et al.  Modeling the perception of multitalker speech , 2005, INTERSPEECH.

[13]  DeLiang Wang,et al.  Transforming Binary Uncertainties for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Alain de Cheveigné,et al.  The Cancellation Principle in Acoustic Scene Analysis , 2005 .

[15]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[16]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  John F. Culling,et al.  Evidence for a cancellation mechanism in perceptual segregation by differences in fundamental frequency , 2005 .

[18]  S. G. Nooteboom,et al.  Intonation and the perceptual separation of simultaneous voices , 1982 .

[19]  T Houtgast,et al.  A physical method for measuring speech-transmission quality. , 1980, The Journal of the Acoustical Society of America.

[20]  R A Lutfi,et al.  Effect of masker harmonicity on informational masking. , 2000, The Journal of the Acoustical Society of America.

[21]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[22]  M. Ericson,et al.  Informational and energetic masking effects in the perception of multiple simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[23]  Hugo Van hamme Robust speech recognition using cepstral domain missing data techniques and noisy masks , 2004, ICASSP.

[24]  Irwin Pollack,et al.  Auditory informational masking , 1975 .

[25]  DeLiang Wang,et al.  Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. , 2006, The Journal of the Acoustical Society of America.

[26]  Guy J. Brown,et al.  A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation , 2004, Speech Commun..

[27]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[28]  Peter S Chang,et al.  Exploration of Behavioral, Physiological, and Computational Approaches to Auditory Scene Analysis , 2004 .

[29]  R L Freyman,et al.  The role of perceived spatial separation in the unmasking of speech. , 1999, The Journal of the Acoustical Society of America.

[30]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[31]  DeLiang Wang,et al.  Robust speech recognition by integrating speech separation and hypothesis testing , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[32]  H. Van hamme,et al.  Robust speech recognition using cepstral domain missing data techniques and noisy masks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[34]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[36]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[37]  T W Tillman,et al.  Perceptual masking in multiple sound backgrounds. , 1969, The Journal of the Acoustical Society of America.

[38]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[39]  Alfred M. Mayer LXI. Researches in acoustics , 1876 .

[40]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[41]  A. Cheveigné Concurrent vowel identification. III. A neural model of harmonic interference cancellation , 1997 .

[42]  Richard Lippmann,et al.  Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise KN-37 , 1997, EUROSPEECH.

[43]  Guy J. Brown,et al.  A neural oscillator sound separator for missing data speech recognition , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).