An Auditory Scene Analysis Approach to Monaural Speech Segregation

A human listener has the remarkable ability to segregate an acoustic mixture and attend to a target sound. This perceptual process is called auditory scene analysis (ASA). Moreover, the listener can accomplish much of auditory scene analysis with only one ear. Research in ASA has inspired many studies in computational auditory scene analysis (CASA) for sound segregation. In this chapter we introduce a CASA approach to monaural speech segregation. After a brief overview of CASA, we present in detail a CASA system that segregates both voiced and unvoiced speech. Our description covers the major stages of CASA, including feature extraction, auditory segmentation, and grouping.

[1]  B. Moore An introduction to the psychology of hearing, 3rd ed. , 1989 .

[2]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[3]  Peter S Chang,et al.  Exploration of Behavioral, Physiological, and Computational Approaches to Auditory Scene Analysis , 2004 .

[4]  John H. L. Hansen,et al.  Speech enhancement using a constrained iterative sinusoidal model , 2001, IEEE Trans. Speech Audio Process..

[5]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[6]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[7]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[8]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[9]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[10]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[11]  Aggelos K. Katsaggelos,et al.  Sound source separation via computational auditory scene analysis-enhanced beamforming , 2002, Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002.

[12]  J. C. R. Licklider A Duplex Theory of Pitch Perception , 1951 .

[13]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[14]  DeLiang Wang,et al.  A schema-based model for phonemic restoration , 2005, Speech Commun..

[15]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[16]  J. Bird Effects of a difference in fundamental frequency in separating two sentences. , 1997 .

[17]  Guy J. Brown,et al.  Separation of Speech by Computational Auditory Scene Analysis , 2005 .

[18]  Peter Ladefoged,et al.  Vowels and Consonants , 2000, Manchu Grammar.

[19]  R Meddis,et al.  Modeling the identification of concurrent vowels with different fundamental frequencies. , 1992, The Journal of the Acoustical Society of America.

[20]  Tony Lindeberg,et al.  Scale-Space Theory in Computer Vision , 1993, Lecture Notes in Computer Science.

[21]  R Meddis,et al.  Simulation of auditory-neural transduction: further studies. , 1988, The Journal of the Acoustical Society of America.

[22]  Hamid Sheikhzadeh,et al.  HMM-based strategies for enhancement of speech signals embedded in nonstationary noise , 1998, IEEE Trans. Speech Audio Process..

[23]  Guy J. Brown,et al.  Speech segregation based on sound localization , 2003 .

[24]  M. Viberg,et al.  Two decades of array signal processing research: the parametric approach , 1996, IEEE Signal Process. Mag..

[25]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  J. Pickles An Introduction to the Physiology of Hearing , 1982 .

[27]  DeLiang Wang,et al.  Separation of fricatives and affricates , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[28]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  DeLiang Wang,et al.  A pitch-based model for separation of reverberant speech , 2005, INTERSPEECH.

[30]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[31]  Jan Van der Spiegel,et al.  Acoustic-phonetic features for the automatic classification of stop consonants , 2001, IEEE Trans. Speech Audio Process..

[32]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[33]  Mitchel Weintraub,et al.  A theory and computational model of auditory monaural sound separation , 1985 .

[34]  DeLiang Wang,et al.  Separation of stop consonants , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[35]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[36]  J. Canny A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[38]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[39]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[40]  R. Carlyon,et al.  Comparing the fundamental frequencies of resolved and unresolved harmonics: Evidence for two pitch mechanisms? , 1994 .

[41]  DeLiang Wang,et al.  Speech segregation based on pitch tracking and amplitude modulation , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[42]  E. Oja,et al.  Independent Component Analysis , 2001 .

[43]  Richard F. Lyon,et al.  A perceptual pitch detector , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[44]  DeLiang Wang,et al.  A binary masking technique for isolating energetic masking in speech perception , 2005 .

[45]  Li Deng,et al.  Variational inference and learning for segmental switching state space models of hidden speech dynamics , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[46]  A. M. Mimpen,et al.  The ear as a frequency analyzer. II. , 1964, The Journal of the Acoustical Society of America.

[47]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[48]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[49]  DeLiang Wang,et al.  Auditory segmentation based on event detection , 2004, SAPA@INTERSPEECH.