Semi-Supervised Source Localization with Deep Generative Modeling

We propose a semi-supervised localization approach based on deep generative modeling with variational autoencoders (VAE). Localization in reverberant environments remains a challenge, which machine learning (ML) has shown promise in addressing. Even with large data volumes, the number of labels available for supervised learning in reverberant environments is usually small. We address this issue by perform semi-supervised learning (SSL) with convolutional VAEs. The VAE is trained to generate the phase of relative transfer functions (RTFs), in parallel with a DOA classifier, on both labeled and unlabeled RTF samples. The VAE-SSL approach is compared with SRP-PHAT and fully-supervised CNNs. We find that VAE-SLL can outperform both SRP-PHAT and CNN in label-limited scenarios.

[1]  Sharon Gannot,et al.  Deep Ranking-Based Sound Source Localization , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[2]  Tara N. Sainath,et al.  Deep Learning for Audio Signal Processing , 2019, IEEE Journal of Selected Topics in Signal Processing.

[3]  Sharon Gannot,et al.  Performance analysis of the covariance-whitening and the covariance-subtraction methods for estimating the relative transfer function , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[4]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Toshiharu Mukai,et al.  3D sound source localization system based on learning of binaural hearing , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[6]  Martin Haardt,et al.  Introduction to the Issue on Acoustic Source Localization and Tracking in Dynamic Real-Life Scenes , 2019, IEEE J. Sel. Top. Signal Process..

[7]  Diederik P. Kingma,et al.  An Introduction to Variational Autoencoders , 2019, Found. Trends Mach. Learn..

[8]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[9]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[10]  Sharon Gannot,et al.  Machine learning in acoustics: Theory and applications. , 2019, The Journal of the Acoustical Society of America.

[11]  Soumitro Chakrabarty,et al.  Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals , 2018, IEEE Journal of Selected Topics in Signal Processing.

[12]  Sharon Gannot,et al.  Spatial Source Subtraction Based on Incomplete Measurements of Relative Transfer Function , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[14]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[15]  Emma Ozanich,et al.  A feedforward neural network for direction-of-arrival estimation. , 2020, The Journal of the Acoustical Society of America.

[16]  Sharon Gannot,et al.  Semi-Supervised Sound Source Localization Based on Manifold Regularization , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Chris Donahue,et al.  Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[19]  Emmanuel Vincent,et al.  Audio Source Separation and Speech Enhancement , 2018 .

[20]  Michael S. Brandstein,et al.  A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Ivan Dokmanic,et al.  Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[26]  Peter Gerstoft,et al.  Three-dimensional source localization using sparse Bayesian learning on a spherical microphone array. , 2020, The Journal of the Acoustical Society of America.

[27]  Radu Horaud,et al.  2D sound-source localization on the binaural manifold , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[28]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[29]  Emanuel A. P. Habets,et al.  Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[30]  Peter Vary,et al.  Multichannel audio database in various acoustic environments , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[31]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[32]  Jaakko Lehtinen,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[34]  GannotSharon,et al.  Semi-supervised sound source localization based on manifold regularization , 2016 .

[35]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[36]  Noah D. Goodman,et al.  Pyro: Deep Universal Probabilistic Programming , 2018, J. Mach. Learn. Res..

[37]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[38]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[39]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[40]  W. Marsden I and J , 2012 .