Ambisonic Signal Processing DNNs Guaranteeing Rotation, Scale and Time Translation Equivariance

We propose a novel framework to design Ambisonic signal processing deep neural networks (DNNs) that guarantee physical symmetries. In general, spatial acoustic signal processing DNNs for, e.g., sound event detection, ought to perform with the equivalent accuracy regardless of the directions of arrival of sound sources. This property is well known as rotation symmetry in natural science. However, in most conventional multichannel signal processing DNNs, rotation symmetry has not been explicitly incorporated into the model structure, and pseudo rotation symmetry has been acquired by training models with a large amount of signal datasets arriving from various directions. Therefore, the conventional methods will not perform sufficiently when the training dataset is relatively small scale or statistically biased, e.g., the distribution of the arriving directions of the sound events is inhomogeneous. Furthermore, in order to efficiently handle acoustic signals in DNNs, it is necessary to consider several additional symmetries, such as amplitude scaling and time translation of the signals. In this paper, we integratedly formulate these symmetry assumptions, which are called equivariance, in the form of constraints for our targeted DNN design. We propose a new DNN design method called Clebsch-Gordan Nets with Scale and Time translation Symmetry (CGNets-STS), which guarantees to simultaneously satisfy three types of equivariance (3D rotation, amplitude scaling, and time translation). As an instance of this method, we design a DNN model for sound event localization and detection tasks from Ambisonic signals. Experimental results show that this model is highly robust against spatial rotations for input data.

[1]  Max Welling,et al.  Gauge Equivariant Convolutional Networks and the Icosahedral CNN 1 , 2019 .

[2]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[3]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[4]  Boaz Rafaely,et al.  Spherical Microphone Array Beam Steering Using Wigner-D Weighting , 2008, IEEE Signal Processing Letters.

[5]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[6]  Emanuel A. P. Habets,et al.  3D source localization in the spherical harmonic domain using a pseudointensity vector , 2010, 2010 18th European Signal Processing Conference.

[7]  Kunihiko Fukushima,et al.  Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition , 1982 .

[8]  DeLiang Wang,et al.  Exploring Deep Complex Networks for Complex Spectrogram Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Chung-Hsien Wu,et al.  Fully complex deep neural network for phase-incorporating monaural source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Max Welling,et al.  Spherical CNNs , 2018, ICLR.

[11]  Masahiro Yasuda,et al.  First Order Ambisonics Domain Spatial Augmentation for DNN-based Direction of Arrival Estimation , 2019, DCASE.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Boaz Rafaely,et al.  Speech Enhancement Using Masking for Binaural Reproduction of Ambisonics Signals , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  W. Bastiaan Kleijn Directional Emphasis in Ambisonics , 2018, IEEE Signal Processing Letters.

[15]  Doron L. Bergman,et al.  Symmetry constrained machine learning , 2018, IntelliSys.

[16]  Kostas Daniilidis,et al.  Learning SO(3) Equivariant Representations with Spherical CNNs , 2017, International Journal of Computer Vision.

[17]  Boaz Rafaely,et al.  Spherical Microphone Array Beamforming , 2010 .

[18]  Andrew Gordon Wilson,et al.  Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data , 2020, ICML.

[19]  I. Kondor,et al.  Group theoretical methods in machine learning , 2008 .

[20]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[21]  Ivan Dokmanic,et al.  Multichannel speech separation with recurrent neural networks from high-order ambisonics recordings , 2018, ICASSP 2018.

[22]  Archontis Politis,et al.  Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel Features , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[23]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[24]  Max Welling,et al.  3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data , 2018, NeurIPS.

[25]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[26]  Emanuël A. P. Habets,et al.  Direction and Reverberation Preserving Noise Reduction of Ambisonics Signals , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[28]  B. Hall Lie Groups, Lie Algebras, and Representations: An Elementary Introduction , 2004 .

[29]  Risi Kondor,et al.  On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups , 2018, ICML.

[30]  Emmanuel Vincent,et al.  CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings , 2019, IEEE Journal of Selected Topics in Signal Processing.

[31]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[32]  Simona Maggio,et al.  Robustness of Rotation-Equivariant Networks to Adversarial Perturbations , 2018, ArXiv.

[33]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[34]  Archontis Politis,et al.  A multi-room reverberant dataset for sound event localization and detection , 2019, DCASE.

[35]  Mingyan Liu,et al.  Spatially Transformed Adversarial Examples , 2018, ICLR.

[36]  Sandeep Subramanian,et al.  Deep Complex Networks , 2017, ICLR.

[37]  G. Gaunaurd,et al.  Acoustic scattering by a pair of spheres , 1995 .

[38]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[39]  Emmanuel Vincent,et al.  Multichannel Speech Separation with Recurrent Neural Networks from High-Order Ambisonics Recordings , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Li Li,et al.  Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds , 2018, ArXiv.

[41]  Walter Kellermann,et al.  Detection and localization of multiple wideband acoustic sources based on wavefield decomposition using spherical apertures , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  L. Infeld Quantum Theory of Fields , 1949, Nature.

[43]  Franz Zotter,et al.  Ambisonics: A Practical 3D Audio Theory for Recording, Studio Production, Sound Reinforcement, and Virtual Reality , 2019 .