Segregation of unvoiced speech from nonspeech interference.

Monaural speech segregation has proven to be extremely challenging. While efforts in computational auditory scene analysis have led to considerable progress in voiced speech segregation, little attention has been given to unvoiced speech, which lacks harmonic structure and has weaker energy, hence more susceptible to interference. This study proposes a new approach to the problem of segregating unvoiced speech from nonspeech interference. The study first addresses the question of how much speech is unvoiced. The segregation process occurs in two stages: Segmentation and grouping. In segmentation, the proposed model decomposes an input mixture into contiguous time-frequency segments by a multiscale analysis of event onsets and offsets. Grouping of unvoiced segments is based on Bayesian classification of acoustic-phonetic features. The proposed model for unvoiced speech segregation joins an existing model for voiced speech segregation to produce an overall system that can deal with both voiced and unvoiced speech. Systematic evaluation shows that the proposed system extracts a majority of unvoiced speech without including much interference, and it performs substantially better than spectral subtraction.

[1]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[3]  Jan Van der Spiegel,et al.  Acoustic-phonetic features for the automatic classification of stop consonants , 2001, IEEE Trans. Speech Audio Process..

[4]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[5]  Mitchel Weintraub,et al.  A theory and computational model of auditory monaural sound separation , 1985 .

[6]  A M Ali,et al.  Acoustic-phonetic features for the automatic classification of fricatives. , 2001, The Journal of the Acoustical Society of America.

[7]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[8]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[9]  S. Nooteboom,et al.  THE PROSODY OF SPEECH: MELODY AND RHYTHM , 2001 .

[10]  Richard F. Lyon,et al.  A perceptual pitch detector , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[11]  J. Licklider,et al.  A duplex theory of pitch perception , 1951, Experientia.

[12]  Steven Greenberg,et al.  INSIGHTS INTO SPOKEN LANGUAGE GLEANED FROM PHONETIC TRANSCRIPTION OF THE SWITCHBOARD CORPUS , 1996 .

[13]  Yonghong Yan,et al.  The contribution of consonants versus vowels to word recognition in fluent speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  DeLiang Wang,et al.  Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. , 2006, The Journal of the Acoustical Society of America.

[15]  Richard F. Lyon,et al.  Computational models of neural auditory processing , 1984, ICASSP.

[16]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  DeLiang Wang,et al.  Separation of fricatives and affricates , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[18]  Godfrey Dewey,et al.  Relativ frequency of English speech sounds , 1923 .

[19]  C. Darwin Auditory grouping , 1997, Trends in Cognitive Sciences.

[20]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[21]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[22]  DeLiang Wang,et al.  An Auditory Scene Analysis Approach to Monaural Speech Segregation , 2006 .

[23]  Guoning Hu,et al.  Monaural speech organization and segregation , 2006 .

[24]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[25]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[26]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[27]  C V Pavlovic,et al.  Derivation of primary parameters and procedures for use in speech intelligibility predictions. , 1987, The Journal of the Acoustical Society of America.

[28]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[29]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[30]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[31]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[32]  Max A. Viergever,et al.  Scale-Space Theory in Computer Vision , 1997 .

[33]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  J. Oghalai,et al.  Chapter 37 – Cochlear Hearing Loss , 2005 .

[35]  DeLiang Wang,et al.  Speech segregation based on pitch tracking and amplitude modulation , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[36]  Richard M. Dansereau,et al.  A Maximum Likelihood Estimation of Vocal-Tract-Related Filter Characteristics for Single Channel Speech Separation , 2006, EURASIP J. Audio Speech Music. Process..

[37]  C. W. Carter,et al.  The words and sounds of telephone conversations , 1930 .

[38]  Tony Lindeberg,et al.  Scale-Space Theory in Computer Vision , 1993, Lecture Notes in Computer Science.

[39]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  T. W. Parsons Separation of speech from interfering speech by means of harmonic selection , 1976 .

[41]  DeLiang Wang,et al.  Unvoiced Speech Segregation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.