Deep Ranking-Based Sound Source Localization

Sound source localization is a cumbersome task in challenging reverberation conditions. Recently, there is a growing interest in developing learning-based localization methods. In this approach, acoustic features are extracted from the measured signals and then given as input to a model that maps them to the corresponding source positions. Typically, a massive dataset of labeled samples from known positions is required to train such models.Here, we present a novel weakly-supervised deep-learning localization method that exploits only a few labeled (anchor) samples with known positions, together with a larger set of unlabeled samples, for which we only know their relative physical ordering. We design an architecture that uses a stochastic combination of triplet-ranking loss for the unlabeled samples and physical loss for the anchor samples, to learn a nonlinear deep embedding that maps acoustic features to an azimuth angle of the source. The combined loss can be optimized effectively using standard gradient-based approach.Evaluating the proposed approach on simulated data, we demonstrate its significant improvement over two previous learning-based approaches for various reverberation levels, while maintaining consistent performance with varying sizes of labeled data.

[1]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[2]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[3]  Michael S. Brandstein,et al.  A closed-form location estimator for use with room environment microphone arrays , 1997, IEEE Trans. Speech Audio Process..

[4]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[5]  Thushara D. Abhayapala,et al.  Coherent broadband source localization by modal space processing , 2003, 10th International Conference on Telecommunications, 2003. ICT 2003..

[6]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[7]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[8]  Sharon Gannot,et al.  Microphone Array Speaker Localizers Using Spatial-Temporal Information , 2006, EURASIP J. Adv. Signal Process..

[9]  Ying Yu,et al.  A Real-Time SRP-PHAT Source Location Implementation using Stochastic Region Contraction(SRC) on a Large-Aperture Microphone Array , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Samy Bengio,et al.  Large-scale content-based audio retrieval from text queries , 2008, MIR '08.

[11]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[12]  Samy Bengio,et al.  Sound Retrieval and Ranking Using Sparse Auditory Representations , 2010, Neural Computation.

[13]  Steven van de Par,et al.  A Probabilistic Model for Robust Localization Based on a Binaural Auditory Front-End , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[15]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Haizhou Li,et al.  A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Radu Horaud,et al.  Acoustic Space Learning for Sound-Source Separation and Localization on Binaural Manifolds , 2014, Int. J. Neural Syst..

[18]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[19]  Sharon Gannot,et al.  Semi-Supervised Sound Source Localization Based on Manifold Regularization , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Kazunori Komatani,et al.  Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sharon Gannot,et al.  Semi-Supervised Source Localization on Multiple Manifolds With Distributed Microphones , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[24]  Archontis Politis,et al.  Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[25]  Sharon Gannot,et al.  Performance analysis of the covariance-whitening and the covariance-subtraction methods for estimating the relative transfer function , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[26]  Emmanuel Vincent,et al.  Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Emmanuel Vincent,et al.  CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings , 2019, IEEE Journal of Selected Topics in Signal Processing.

[28]  Soumitro Chakrabarty,et al.  Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals , 2018, IEEE Journal of Selected Topics in Signal Processing.

[29]  Walter Kellermann,et al.  Distributed Source Localization in Acoustic Sensor Networks Using the Coherent-to-Diffuse Power Ratio , 2019, IEEE Journal of Selected Topics in Signal Processing.