A Computational Auditory Scene Analysis System for Robust Speech Recognition

We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time-frequency (T-F) mask which retains the mixture in a local TF unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T-F units across time frames. The resulting T-F masks are used in conjunction with missing-data methods for recognition. Systematic evaluations on a speech separation challenge task show significant improvement over the baseline performance.

[1]  Richard Lippmann,et al.  Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise KN-37 , 1997, EUROSPEECH.

[2]  Richard F. Lyon,et al.  Automatic Gain Control in Cochlear Mechanics , 1990 .

[3]  C. M. Marin,et al.  Concurrent vowel identification II: Effects of phase, harmonicity and task , 1997 .

[4]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[5]  S. G. Nooteboom,et al.  Intonation and the perceptual separation of simultaneous voices , 1982 .

[6]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[7]  J. M. Ackroff,et al.  Auditory Induction: Perceptual Synthesis of Absent Sounds , 1972, Science.

[8]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[9]  Richard F. Lyon,et al.  Computational models of neural auditory processing , 1984, ICASSP.

[10]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  DeLiang Wang,et al.  Robust Speaker Recognition Using Binary Time-Frequency Masks , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Biing-Hwang Juang,et al.  Filtering the time sequences of spectral parameters for speech recognition, , 1997, Speech Commun..

[13]  Masataka Goto,et al.  Multiagent based binaural sound stream segregation , 1998 .

[14]  Mark Hasegawa-Johnson,et al.  A factorial HMM approach to simultaneous recognition of isolated digits spoken by multiple talkers on one audio channel , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Steven Greenberg,et al.  UNDERSTANDING SPEECH UNDERSTANDING: TOWARDS A UNIFIED THEORY OF SPEECH PERCEPTION , 1996 .

[16]  Stephanie Seneff,et al.  Pitch and spectral analysis of speech based on an auditory synchrony model , 1985 .

[17]  Phil D. Green,et al.  Missing data techniques for robust speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Luc Vincent,et al.  Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[20]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[21]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[22]  Mitchel Weintraub,et al.  A theory and computational model of auditory monaural sound separation , 1985 .

[23]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  S. Shamma Speech processing in the auditory system. I: The representation of speech sounds in the responses of the auditory nerve. , 1985, The Journal of the Acoustical Society of America.

[25]  Tomohiro Nakatani,et al.  Localization by harmonic structure and its application to harmonic sound stream segregation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[26]  Avery Li-Chun Wang,et al.  Instantaneous and frequency-warped signal processing techniques for auditory source separation , 1994 .

[27]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[28]  Hynek Hermansky,et al.  On properties of modulation spectrum for robust automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[29]  Richard F. Lyon A computational model of binaural localization and separation , 1983, ICASSP.

[30]  Te-Won Lee,et al.  A Probabilistic Approach to Single Channel Blind Signal Separation , 2002, NIPS.

[31]  Fabrice Plante,et al.  Segregation of concurrent speech with the reassigned spectrum , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[33]  Hideki Kawahara,et al.  Speech separation for speech recognition , 1994 .

[34]  Phil D. Green,et al.  RECOGNITION OF OCCLUDED SPEECH BY HIDDEN MARKOV MODELS , 1994 .

[35]  John R. Hershey,et al.  Single microphone source separation using high resolution signal reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  Guy J. Brown Computational auditory scene analysis : a representational approach , 1993 .

[37]  A. de Cheveigné Cancellation model of pitch perception. , 1998, The Journal of the Acoustical Society of America.

[38]  H. Kitano,et al.  Incorporating Visual Information into Sound Source Separation , 1999 .

[39]  Guoning Hu,et al.  Monaural speech organization and segregation , 2006 .

[40]  Malcolm Slaney,et al.  A critique of pure audition , 1998 .

[41]  W M Hartmann,et al.  Pitch, periodicity, and auditory organization. , 1996, The Journal of the Acoustical Society of America.

[42]  Phil D. Green,et al.  Some solution to the missing feature problem in data classification, with application to noise robust ASR , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[43]  Frédéric Berthommier,et al.  Source separation by a functional model of amplitude demodulation , 1995, EUROSPEECH.

[44]  John F. Culling,et al.  Periodicity of maskers not targets determines ease of perceptual segregation using differences in fundamental frequency , 1992 .

[45]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[46]  S. McAdams Segregation of concurrent sounds. I: Effects of frequency modulation coherence. , 1989, The Journal of the Acoustical Society of America.

[47]  Daniel P. W. Ellis Computational auditory scene analysis exploiting speech-recognition knowledge , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[48]  S McAdams,et al.  Identification of concurrent harmonic and inharmonic vowels: a test of the theory of harmonic cancellation and enhancement. , 1995, The Journal of the Acoustical Society of America.

[49]  R P Carlyon Further evidence against an across-frequency mechanism specific to the detection of frequency modulation (FM) incoherence between resolved frequency components. , 1994, The Journal of the Acoustical Society of America.

[50]  Guy J. Brown,et al.  Computational auditory scene analysis: Exploiting principles of perceived continuity , 1993, Speech Commun..

[51]  Bhiksha Raj,et al.  Recognizing speech from simultaneous speakers , 2005, INTERSPEECH.

[52]  A. Cheveigné Concurrent vowel identification. III. A neural model of harmonic interference cancellation , 1997 .

[53]  Tomohiro Nakatani,et al.  Combining Independent Component Analysis and Sound Stream Segregation , 1999 .

[54]  Alain de Cheveigné,et al.  Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancell , 1993 .

[55]  Hideki Kawahara,et al.  Multiple period estimation and pitch perception model , 1999, Speech Commun..

[56]  S. Sheft,et al.  A simulated “cocktail party” with up to three sound sources , 1996, Perception & psychophysics.

[57]  R Meddis,et al.  Simulation of auditory-neural transduction: further studies. , 1988, The Journal of the Acoustical Society of America.

[58]  S. Hanson,et al.  Some Solutions to the Missing Feature Problem in Vision , 1993 .

[59]  Daniel P. W. Ellis,et al.  The auditory organization of speech and other sources in listeners and computational models , 2001, Speech Commun..

[60]  Koch Sigmund Ed,et al.  Psychology: A Study of A Science , 1962 .

[61]  Stephen McAdams,et al.  Spectral fusion, spectral parsing and the formation of auditory images , 1984 .

[62]  Phil D. Green,et al.  Auditory scene analysis and hidden Markov model recognition of speech in noise , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[63]  DeLiang Wang,et al.  A Supervised Learning Approach to Uncertainty Decoding for Robust Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[64]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[65]  Quentin Summerfield Roles of Harmonicity and Coherent Frequency Modulation in Auditory Grouping. , 1992 .

[66]  Tomohiro Nakatani,et al.  Residue-Driven Architecture for Computational Auditory Scene Analysis , 1995, IJCAI.

[67]  Q. Summerfield,et al.  Modeling the perception of concurrent vowels: vowels with different fundamental frequencies. , 1990, The Journal of the Acoustical Society of America.

[68]  N. Durlach Equalization and Cancellation Theory of Binaural Masking‐Level Differences , 1963 .

[69]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[70]  T. Yin,et al.  Envelope coding in the lateral superior olive. III. Comparison with afferent pathways. , 1998, Journal of neurophysiology.

[71]  J. Culling,et al.  Perceptual separation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay. , 1995, The Journal of the Acoustical Society of America.

[72]  Guy J. Brown,et al.  Physiologically-motivated signal representations for computational auditory scene analysis , 1993 .

[73]  David K. Mellinger,et al.  Event formation and separation in musical sound , 1992 .

[74]  R Meddis,et al.  Modeling the identification of concurrent vowels with different fundamental frequencies. , 1992, The Journal of the Acoustical Society of America.

[75]  R. M. Warren Perceptual Restoration of Missing Speech Sounds , 1970, Science.

[76]  J. Culling,et al.  Auditory segregation of competing voices: absence of effects of FM or AM coherence. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[77]  T. W. Parsons Separation of speech from interfering speech by means of harmonic selection , 1976 .