Binaural Localization of Multiple Sound Sources by Non-Negative Tensor Factorization

This paper presents non-negative factorization of audio signals for the binaural localization of multiple sound sources within realistic and unknown sound environments. Non-negative tensor factorization (NTF) provides a sparse representation of multichannel audio signals in time, frequency, and space that can be exploited in computational audio scene analysis and robot audition for the separation and localization of sound sources. In the proposed formulation, each sound source is represented by means of spectral dictionaries, temporal activation, and its distribution within each channel (here, left and right ears). This distribution, being dependent on the frequency, can be interpreted as an explicit estimation of the Head-Related Transfer Function (HRTF) of a binaural head which can then be converted into the estimated sound source position. Moreover, the semisupervised formulation of the non-negative factorization allows us to integrate prior knowledge about some sound sources of interest whose dictionaries can be learned in advance, whereas the remaining sources are considered as background sound, which remains unknown and is estimated on the fly. The proposed NTF-based sound source localization is applied here to binaural sound source localization of multiple speakers within realistic sound environments.

[1]  Özgür Yilmaz,et al.  Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[3]  Reishi Kondo,et al.  Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Gaël Richard,et al.  Acoustic scene classification with matrix factorization for unsupervised feature learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Seokjin Lee,et al.  Beamspace-Domain Multichannel Nonnegative Matrix Factorization for Audio Source Separation , 2012, IEEE Signal Processing Letters.

[6]  Harald Viste,et al.  Binaural Source Localization by Joint Estimation of ILD and ITD , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  E. B. Newman,et al.  The localization of actual sources of sound. , 1936 .

[8]  P. Paatero Least squares formulation of robust non-negative factor analysis , 1997 .

[9]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Parham Aarabi,et al.  Self-localizing dynamic microphone arrays , 2002 .

[11]  Annamaria Mesaros,et al.  Sound Event Detection in Multisource Environments Using Source Separation , 2011 .

[12]  Irfan A. Essa,et al.  Estimating the Spatial Position of Spectral Components in Audio , 2006, ICA.

[13]  Axel Röbel,et al.  A source/filter model with adaptive constraints for NMF-based speech separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  W. G. Gardner,et al.  HRTF measurements of a KEMAR , 1995 .

[15]  Olivier Warusfel,et al.  Twenty Years of Ircam Spat: Looking Back, Looking Forward , 2015, ICMC.

[16]  Emmanuel Vincent,et al.  First Stereo Audio Source Separation Evaluation Campaign: Data, Algorithms and Results , 2007, ICA.

[17]  Harald Viste,et al.  Binaural Source Localization , 2004 .

[18]  Rémi Gribonval,et al.  Audio source separation with a single sensor , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Katsutoshi Itoyama,et al.  Identification and Localization of One or Two Concurrent Speakers in a Binaural Robotic Context , 2015, 2015 IEEE International Conference on Systems, Man, and Cybernetics.

[20]  Hirokazu Kameoka,et al.  Multichannel Extensions of Non-Negative Matrix Factorization With Complex-Valued Data , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[23]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[24]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[25]  Jérôme Idier,et al.  Algorithms for Nonnegative Matrix Factorization with the β-Divergence , 2010, Neural Computation.

[26]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[27]  Marc Rébillat,et al.  A spherical cross-channel algorithm for binaural sound localization , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  D. Fitzgerald,et al.  Non-negative Tensor Factorisation for Sound Source Separation , 2005 .

[30]  Radu Horaud,et al.  Variational EM for binaural sound-source separation and localization , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Gaël Richard,et al.  Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Robert A. Butler,et al.  The bandwidth effect on monaural and binaural localization , 1986, Hearing Research.

[33]  DeLiang Wang,et al.  Binaural Localization of Multiple Sources in Reverberant and Noisy Environments , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[35]  Paris Smaragdis,et al.  Directional NMF for joint source localization and separation , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[36]  S. Rickard,et al.  DOA estimation of many W-disjoint orthogonal sources from two mixtures using DUET , 2000, Proceedings of the Tenth IEEE Workshop on Statistical Signal and Array Processing (Cat. No.00TH8496).

[37]  Klaus Diepold,et al.  Robotic binaural localization and separation of more than two concurrent sound sources , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[38]  Hiroshi G. Okuno,et al.  Robot audition: Its rise and perspectives , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Daniel P. W. Ellis,et al.  EM Localization and Separation using Interaural Level and Phase Cues , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[40]  Emmanuel Vincent,et al.  Multi-source TDOA estimation in reverberant audio using angular spectra and clustering , 2012, Signal Process..

[41]  Reishi Kondo,et al.  Acoustic Event Detection Method Using Semi-Supervised Non-Negative Matrix Factorization with Mixtures of Local Dictionaries , 2016, DCASE.