Low-Complexity Acoustic Scene Classification for Multi-Device Audio: Analysis of DCASE 2021 Challenge Systems

This paper presents the details of Task 1A Low-Complexity Acoustic Scene Classification with Multiple Devices in the DCASE 2021 Challenge. The task targeted development of low-complexity solutions with good generalization properties. The provided baseline system is based on a CNN architecture and post-training quantization of parameters. The system is trained using all the available training data, without any specific technique for handling device mismatch, and obtains an overall accuracy of 47.7%, with a log loss of 1.473. The task received 99 submissions from 30 teams, and most of the submitted systems outperformed the baseline. The most used techniques among the submissions were residual networks and weight quantization, with the top systems reaching over 70% accuracy, and log loss under 0.8. The acoustic scene classification task remained a popular task in the challenge, despite the increasing difficulty of the setup.

[1]  Jangho Kim,et al.  QTI Submission to DCASE 2021: residual normalization for device-imbalanced acoustic scene classification with efficient design , 2022, ArXiv.

[2]  S. M. Siniscalchi,et al.  A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification , 2021, ArXiv.

[3]  Anahid N. Jalali,et al.  DCASE 2021 Task 1 B : Technique Report , 2021 .

[4]  Annamaria Mesaros,et al.  Acoustic Scene Classification in DCASE 2020 Challenge: Generalization Across Devices and Low Complexity Solutions , 2020, DCASE.

[5]  Mark D. McDonnell,et al.  Acoustic Scene Classification Using Deep Residual Networks with Late Fusion of Separated High and Low Frequency Paths , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  A. Mesaros,et al.  TAU Urban Acoustic Scenes 2020 Mobile, Development dataset , 2020 .

[7]  T. Virtanen,et al.  Sound Event Detection Via Dilated Convolutional Recurrent Neural Networks , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Taejin Lee,et al.  Designing Acoustic Scene Classification Models with CNN Variants Technical Report , 2020 .

[9]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS TO DCASE’20: LOW-COMPLEXITY CROSS-DEVICE ACOUSTIC SCENE CLASSIFICATION WITH RF-REGULARIZED CNNS Technical Report , 2020 .

[10]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[11]  Anish Arora,et al.  EdgeL^3: Compressing L^3-Net for Mote Scale Urban Noise Monitoring , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[12]  Justin Salamon,et al.  Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Emmanuel Vincent,et al.  Sound Event Detection in the DCASE 2017 Challenge , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[15]  Annamaria Mesaros,et al.  Acoustic Scene Classification in DCASE 2019 Challenge: Closed and Open Set Classification and Data Mismatch Setups , 2019, DCASE.

[16]  M. Kosmider,et al.  CALIBRATING NEURAL NETWORKS FOR SECONDARY RECORDING DEVICES Technical Report , 2019 .

[17]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[18]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Elad Eban,et al.  MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Dan Stowell,et al.  Approaches to Complex Sound Scene Analysis , 2018 .

[21]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Sacha Krstulovic,et al.  Automatic Environmental Sound Recognition: Performance Versus Computational Cost , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.