Pitch-based monaural segregation of reverberant speech.

In everyday listening, both background noise and reverberation degrade the speech signal. Psychoacoustic evidence suggests that human speech perception under reverberant conditions relies mostly on monaural processing. While speech segregation based on periodicity has achieved considerable progress in handling additive noise, little research in monaural segregation has been devoted to reverberant scenarios. Reverberation smears the harmonic structure of speech signals, and our evaluations using a pitch-based segregation algorithm show that an increase in the room reverberation time causes degraded performance due to weakened periodicity in the target signal. We propose a two-stage monaural separation system that combines the inverse filtering of the room impulse response corresponding to target location and a pitch-based speech segregation method. As a result of the first stage, the harmonicity of a signal arriving from target direction is partially restored while signals arriving from other directions are further smeared, and this leads to improved segregation. A systematic evaluation of the system shows that the proposed system results in considerable signal-to-noise ratio gains across different conditions. Potential applications of this system include robust automatic speech recognition and hearing aid design.

[1]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[2]  DeLiang Wang,et al.  A two-stage algorithm for one-microphone reverberant speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[4]  Mitchel Weintraub,et al.  A theory and computational model of auditory monaural sound separation , 1985 .

[5]  P. N. Denbigh,et al.  A speech separation system that is robust to reverberation , 1994, Proceedings of ICSIPNN '94. International Conference on Speech, Image Processing and Neural Networks.

[6]  DeLiang Wang,et al.  Separation of stop consonants , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[7]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[8]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[9]  Henrique S. Malvar,et al.  Speech dereverberation via maximum-kurtosis subband adaptive filtering , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  P. N. Denbigh,et al.  A sound segregation algorithm for reverberant conditions , 2001, Speech Commun..

[11]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[12]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[13]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[15]  Jean Rouat,et al.  A pitch determination and voiced/unvoiced decision algorithm for noisy speech , 1995, Speech Commun..

[16]  Chaz Yee Toh,et al.  Effects of reverberation on perceptual segregation of competing voices. , 2003, The Journal of the Acoustical Society of America.

[17]  R W Hukin,et al.  Effects of reverberation on spatial, prosodic, and vocal-tract size cues to selective attention. , 2000, The Journal of the Acoustical Society of America.

[18]  J. Licklider,et al.  A duplex theory of pitch perception , 1951, Experientia.

[19]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[20]  R. M. Sachs,et al.  Anthropometric manikin for acoustic research. , 1975, The Journal of the Acoustical Society of America.

[21]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[23]  C. Darwin Auditory grouping , 1997, Trends in Cognitive Sciences.

[24]  DeLiang Wang,et al.  Separation of fricatives and affricates , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[25]  Allan Kardec Barros,et al.  Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and wavelets , 2002, IEEE Trans. Neural Networks.

[26]  Jont B. Allen,et al.  Invertibility of a room impulse response , 1979 .

[27]  A. Nabelek,et al.  Monaural and binaural speech perception in reverberation for listeners of various ages. , 1982, The Journal of the Acoustical Society of America.

[28]  C. Richmond A note on non-Gaussian adaptive array detection and signal parameter estimation , 1996, IEEE Signal Processing Letters.

[29]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[30]  Les E. Atlas,et al.  Acoustic diversity for improved speech recognition in reverberant environments , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  G.-J. Jang,et al.  Single-channel signal separation using time-domain basis functions , 2003, IEEE Signal Processing Letters.

[32]  Tomohiro Nakatani,et al.  Blind dereverberation of single channel speech signal based on harmonic structure , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[33]  B. Moore An introduction to the psychology of hearing, 3rd ed. , 1989 .

[34]  Guy J. Brown,et al.  Separation of Speech by Computational Auditory Scene Analysis , 2005 .

[35]  DeLiang Wang,et al.  Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. , 2006, The Journal of the Acoustical Society of America.

[36]  Ning Ma,et al.  Perceptual Kalman filtering for speech enhancement in colored noise , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Guy J. Brown,et al.  A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation , 2004, Speech Commun..

[38]  Bill Gardner,et al.  HRTF Measurements of a KEMAR Dummy-Head Microphone , 1994 .

[39]  C. J. Darwin,et al.  Chapter 11 – Auditory Grouping , 1995 .

[40]  A. Bregman Auditory Scene Analysis , 2008 .

[41]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[42]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[43]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[44]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[45]  Barak A. Pearlmutter,et al.  Independent Component Analysis: Blind source separation by sparse decomposition in a signal dictionary , 2001 .

[46]  Richard F. Lyon,et al.  On the importance of time—a temporal representation of sound , 1993 .

[47]  Ken'ichi Furuya,et al.  Two-channel blind deconvolution for non-minimum phase impulse responses , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Mingyang Wu,et al.  Pitch tracking and speech enhancement in noisy and reverberant environments , 2003 .

[49]  Barak A. Pearlmutter,et al.  Blind source separation by sparse decomposition , 2000, SPIE Defense + Commercial Sensing.