Binaural Speech Separation Using Recurrent Timing Neural Networks for Joint F0-Localisation Estimation

A novel extension to recurrent timing neural networks (RTNNs) is proposed which allows such networks to exploit a joint interaural time difference-fundamental frequency (ITD-F0) auditory cue as opposed to F0 only. This extension involves coupling a second layer of coincidence detectors to a two-dimensional RTNN. The coincidence detectors are tuned to particular ITDs and each feeds excitation to a column in the RTNN. Thus, one axis of the RTNN represents F0 and the other ITD. The resulting behaviour allows sources to be segregated on the basis of their separation in ITD-F0 space. Furthermore, all grouping and segregation activity proceeds within individual frequency channels without recourse to across channel estimates of F0 or ITD that are commonly used in auditory scene analysis approaches. The system has been evaluated using a source separation task operating on spatialised speech signals.

[1]  L A JEFFRESS,et al.  A place theory of sound localization. , 1948, Journal of comparative and physiological psychology.

[2]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[3]  Q. Summerfield,et al.  Modeling the perception of concurrent vowels: vowels with different fundamental frequencies. , 1990, The Journal of the Acoustical Society of America.

[4]  P. Cariani Recurrent timing nets for auditory scene analysis , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[5]  Andreas Ziehe,et al.  An approach to blind source separation based on temporal structure of speech signals , 2001, Neurocomputing.

[6]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[7]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[8]  Guy J. Brown,et al.  A computational model of auditory selective attention , 2004, IEEE Transactions on Neural Networks.

[9]  J. Bird Effects of a difference in fundamental frequency in separating two sentences. , 1997 .

[10]  J. Blauert Spatial Hearing: The Psychophysics of Human Sound Localization , 1983 .

[11]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[12]  W. G. Gardner,et al.  HRTF measurements of a KEMAR , 1995 .

[13]  John F Culling,et al.  The spatial unmasking of speech: evidence for within-channel processing of interaural time delay. , 2005, The Journal of the Acoustical Society of America.

[14]  Richard F. Lyon A computational model of binaural localization and separation , 1983, ICASSP.

[15]  J. Culling,et al.  Perceptual and computational separation of simultaneous vowels: cues arising from low-frequency beating. , 1994, The Journal of the Acoustical Society of America.

[16]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[17]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[18]  Michaël Titus Maria Scheffers,et al.  Sifting vowels. Auditory pitch analysis and sound segregation. , 1983 .

[19]  Peter A. Cariani,et al.  Neural timing nets , 2001, Neural Networks.

[20]  S. G. Nooteboom,et al.  Intonation and the perceptual separation of simultaneous voices , 1982 .

[21]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[22]  Alain de Cheveigné,et al.  Time-domain auditory processing of speech , 2003, J. Phonetics.

[23]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.