Discriminative multiple sound source localization based on deep neural networks using independent location model

We propose a training method for multiple sound source localization (SSL) based on deep neural networks (DNNs). Such networks function as posterior probability estimator of sound location in terms of position labels and achieve high localization correctness. Since the previous DNNs' configuration for SSL handles one-sound-source cases, it should be extended to multiple-sound-source cases to apply it to real environments. However, a naïve design causes 1) an increase in the number of labels and training data patterns and 2) a lack of label consistency across different numbers of sound sources, such as one and two-or-more-sound cases. These two problems were solved using our proposed method, which involves an independent location model for the former and an block-wise consistent labeling with ordering for the latter. Our experiments indicated that the SSL based on DNNs trained by our proposed training method out-performed a conventional SSL method by a maximum of 18 points in terms of block-level correctness.

[1]  Kazunori Komatani,et al.  Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Mauricio Kugler,et al.  An Approach for Sound Source Localization by Complex-Valued Neural Network , 2013, IEICE Trans. Inf. Syst..

[3]  Thomas Kailath,et al.  ESPRIT-estimation of signal parameters via rotational invariance techniques , 1989, IEEE Trans. Acoust. Speech Signal Process..

[4]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[5]  Biing-Hwang Juang,et al.  Blind speech dereverberation with multi-channel linear prediction based on short time fourier transform representation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Keisuke Nakamura,et al.  Intelligent sound source localization for dynamic environments , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Michael Georgiopoulos,et al.  A neural network-based smart antenna for multiple source tracking , 2000 .

[9]  Stefan Wermter,et al.  Robotic sound-source localisation architecture using cross-correlation and recurrent neural networks , 2009, Neural Networks.

[10]  Bin Yang,et al.  Disambiguation of TDOA Estimation for Multiple Sources in Reverberant Environments , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[13]  Hiroaki Kitano,et al.  Active Audition for Humanoid , 2000, AAAI/IAAI.

[14]  W.-H. Yang,et al.  Complex-valued neural network for direction of arrival estimation , 1994 .

[15]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  M. N. Shanmukha Swamy,et al.  Neural methods for antenna array signal processing: a review , 2002, Signal Process..

[17]  Don Torrieri,et al.  Simplification of the MUSIC algorithm using a neural network , 1996, Proceedings of MILCOM '96 IEEE Military Communications Conference.

[18]  Pierre Blazevic,et al.  Mechatronic design of NAO humanoid , 2009, 2009 IEEE International Conference on Robotics and Automation.

[19]  Saeid Haghighatshoar,et al.  Robust microphone placement for source localization from noisy distance measurements , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Hiroshi G. Okuno,et al.  Design and implementation of selectable sound separation on the Texai telepresence system using HARK , 2011, 2011 IEEE International Conference on Robotics and Automation.

[21]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[22]  M. Viberg,et al.  Two decades of array signal processing research: the parametric approach , 1996, IEEE Signal Process. Mag..

[23]  Masahito Togami,et al.  Optimized Speech Dereverberation From Probabilistic Perspective for Time Varying Acoustic Transfer Function , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Hiroshi Sawada,et al.  Bayesian Nonparametrics for Microphone Array Processing , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Bhaskar D. Rao,et al.  Performance analysis of Root-Music , 1989, IEEE Trans. Acoust. Speech Signal Process..