META-SELD: Meta-Learning for Fast Adaptation to the new environment in Sound Event Localization and Detection

For learning-based sound event localization and detection (SELD) methods, different acoustic environments in the training and test sets may result in large performance differences in the validation and evaluation stages. Different environments, such as different sizes of rooms, different reverberation times, and different background noise, may be reasons for a learning-based system to fail. On the other hand, acquiring annotated spatial sound event samples, which include onset and offset time stamps, class types of sound events, and direction-of-arrival (DOA) of sound sources is very expensive. In addition, deploying a SELD system in a new environment often poses challenges due to time-consuming training and fine-tuning processes. To address these issues, we propose Meta-SELD, which applies meta-learning methods to achieve fast adaptation to new environments. More specifically, based on Model Agnostic Meta-Learning (MAML), the proposed Meta-SELD aims to find good meta-initialized parameters to adapt to new environments with only a small number of samples and parameter updating iterations. We can then quickly adapt the meta-trained SELD model to unseen environments. Our experiments compare fine-tuning methods from pre-trained SELD models with our Meta-SELD on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSSS23) dataset. The evaluation results demonstrate the effectiveness of Meta-SELD when adapting to new environments.

[1]  Qiuqiang Kong,et al.  Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains , 2022, DCASE.

[2]  T. Virtanen,et al.  STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events , 2022, DCASE.

[3]  Mark D. Plumbley,et al.  A Track-Wise Ensemble Event Independent Network for Polyphonic Sound Event Localization and Detection , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Ngo Anh Vien,et al.  What Matters For Meta-Learning Vision Regression Tasks? , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Hung-yi Lee,et al.  Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech , 2021, IEEE/ACM Transactions on Audio Speech and Language Processing.

[6]  Naoya Takahashi,et al.  Multi-ACCDOA: Localizing And Detecting Overlapping Sounds From The Same Class With Auxiliary Duplicating Permutation Invariant Training , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Archontis Politis,et al.  A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection , 2021, DCASE.

[8]  Naoya Takahashi,et al.  Accdoa: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization And Detection , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Mark D. Plumbley,et al.  Event-Independent Network for Polyphonic Sound Event Localization and Detection , 2020, DCASE.

[10]  X. Serra,et al.  FSD50K: An Open Dataset of Human-Labeled Sound Events , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Archontis Politis,et al.  Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019 , 2020, IEEE/ACM Transactions on Audio Speech and Language Processing.

[12]  Archontis Politis,et al.  A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection , 2020, DCASE.

[13]  Timothy M. Hospedales,et al.  Meta-Learning in Neural Networks: A Survey , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Archontis Politis,et al.  Joint Measurement of Localization and Detection of Sound Events , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[16]  Archontis Politis,et al.  A multi-room reverberant dataset for sound event localization and detection , 2019, DCASE.

[17]  James T. Kwok,et al.  Generalizing from a Few Examples , 2019, ACM Comput. Surv..

[18]  Yu-Chiang Frank Wang,et al.  A Closer Look at Few-shot Classification , 2019, ICLR.

[19]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[20]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[21]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Barhoush Mahdi,et al.  Localization-driven Speech Enhancement in Noisy Multi-speaker Hospital Environments Using Deep Learning and Meta Learning , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.