Recurrent Timing Neural Networks for Joint F0-Localisation Based Speech Separation

A novel extension to recurrent timing neural networks (RTNNs) is proposed which allows such networks to exploit a joint interaural time difference-fundamental frequency (ITD-F0) auditory cue as opposed to F0 only This extension involves coupling a second layer of coincidence detectors to a two-dimensional RTNN. The coincidence detectors are tuned to particular ITDs and each feeds excitation to a column in the RTNN. Thus, one axis of the RTNN represents FO and the other ITD. The resulting behaviour allows sources to be segregated on the basis of their separation in ITD-F0 space. Furthermore, all grouping and segregation activity proceeds within individual frequency channels without recourse to across channel estimates of FO or ITD that are commonly used in auditory scene analysis approaches. The system has been evaluated using a source separation task operating on spatialised speech signals.

[1]  L A JEFFRESS,et al.  A place theory of sound localization. , 1948, Journal of comparative and physiological psychology.

[2]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[3]  W. G. Gardner,et al.  HRTF measurements of a KEMAR , 1995 .

[4]  J. Bird Effects of a difference in fundamental frequency in separating two sentences. , 1997 .

[5]  Guy J. Brown,et al.  Speech segregation based on sound localization , 2003 .

[6]  John F Culling,et al.  The spatial unmasking of speech: evidence for within-channel processing of interaural time delay. , 2005, The Journal of the Acoustical Society of America.

[7]  P. Cariani Recurrent timing nets for auditory scene analysis , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[8]  Alain de Cheveigné,et al.  Time-domain auditory processing of speech , 2003, J. Phonetics.

[9]  Yoshitaka Nakajima,et al.  Auditory Scene Analysis: The Perceptual Organization of Sound Albert S. Bregman , 1992 .

[10]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[11]  Peter A. Cariani,et al.  Neural timing nets , 2001, Neural Networks.

[12]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[13]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[14]  Q. Summerfield,et al.  Modeling the perception of concurrent vowels: vowels with different fundamental frequencies. , 1990, The Journal of the Acoustical Society of America.

[15]  J. Culling,et al.  Perceptual and computational separation of simultaneous vowels: cues arising from low-frequency beating. , 1994, The Journal of the Acoustical Society of America.

[16]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[17]  A. Bregman Auditory Scene Analysis , 2008 .

[18]  J. Blauert Spatial Hearing: The Psychophysics of Human Sound Localization , 1983 .