论文信息 - Deep Ranking-Based Sound Source Localization

Deep Ranking-Based Sound Source Localization

Sound source localization is a cumbersome task in challenging reverberation conditions. Recently, there is a growing interest in developing learning-based localization methods. In this approach, acoustic features are extracted from the measured signals and then given as input to a model that maps them to the corresponding source positions. Typically, a massive dataset of labeled samples from known positions is required to train such models.Here, we present a novel weakly-supervised deep-learning localization method that exploits only a few labeled (anchor) samples with known positions, together with a larger set of unlabeled samples, for which we only know their relative physical ordering. We design an architecture that uses a stochastic combination of triplet-ranking loss for the unlabeled samples and physical loss for the anchor samples, to learn a nonlinear deep embedding that maps acoustic features to an azimuth angle of the source. The combined loss can be optimized effectively using standard gradient-based approach.Evaluating the proposed approach on simulated data, we demonstrate its significant improvement over two previous learning-based approaches for various reverberation levels, while maintaining consistent performance with varying sizes of labeled data.

[1] Jont B. Allen,et al. Image method for efficiently simulating small‐room acoustics , 1976 .

[2] R. O. Schmidt,et al. Multiple emitter location and signal Parameter estimation , 1986 .

[3] Michael S. Brandstein,et al. A closed-form location estimator for use with room environment microphone arrays , 1997, IEEE Trans. Speech Audio Process..

[4] Michael S. Brandstein,et al. Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[5] Thushara D. Abhayapala,et al. Coherent broadband source localization by modal space processing , 2003, 10th International Conference on Telecommunications, 2003. ICT 2003..

[6] Kilian Q. Weinberger,et al. Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[7] Mikhail Belkin,et al. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[8] Sharon Gannot,et al. Microphone Array Speaker Localizers Using Spatial-Temporal Information , 2006, EURASIP J. Adv. Signal Process..

[9] Ying Yu,et al. A Real-Time SRP-PHAT Source Location Implementation using Stochastic Region Contraction(SRC) on a Large-Aperture Microphone Array , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10] Samy Bengio,et al. Large-scale content-based audio retrieval from text queries , 2008, MIR '08.

[11] Samy Bengio,et al. Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[12] Samy Bengio,et al. Sound Retrieval and Ranking Using Sparse Auditory Representations , 2010, Neural Computation.

[13] Steven van de Par,et al. A Probabilistic Model for Robust Localization Based on a Binaural Auditory Front-End , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Andrew Zisserman,et al. Deep Face Recognition , 2015, BMVC.

[15] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Haizhou Li,et al. A learning-based approach to direction of arrival estimation in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Radu Horaud,et al. Acoustic Space Learning for Sound-Source Separation and Localization on Binaural Manifolds , 2014, Int. J. Neural Syst..

[18] Timothy Dozat,et al. Incorporating Nesterov Momentum into Adam , 2016 .

[19] Sharon Gannot,et al. Semi-Supervised Sound Source Localization Based on Manifold Regularization , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] Kazunori Komatani,et al. Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Sharon Gannot,et al. Semi-Supervised Source Localization on Multiple Manifolds With Distributed Microphones , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22] Hervé Bredin,et al. TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Chunlei Zhang,et al. End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[24] Archontis Politis,et al. Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[25] Sharon Gannot,et al. Performance analysis of the covariance-whitening and the covariance-subtraction methods for estimating the relative transfer function , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[26] Emmanuel Vincent,et al. Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Emmanuel Vincent,et al. CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings , 2019, IEEE Journal of Selected Topics in Signal Processing.

[28] Soumitro Chakrabarty,et al. Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals , 2018, IEEE Journal of Selected Topics in Signal Processing.

[29] Walter Kellermann,et al. Distributed Source Localization in Acoustic Sensor Networks Using the Coherent-to-Diffuse Power Ratio , 2019, IEEE Journal of Selected Topics in Signal Processing.