CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Domain mismatch is a noteworthy issue in acoustic event detection tasks, as the target domain data is difficult to access in most real applications. In this study, we propose a novel CNNbased discriminative training framework as a domain compensation method to handle this issue. It uses a parallel CNN-based discriminator to learn a pair of high-level intermediate acoustic representations. Together with a binary discriminative loss, the discriminators are forced to maximally exploit the discrimination of heterogeneous acoustic information in each audio clip with target events, which results in a robust paired representations that can well discriminate the target events and background/domain variations separately. Moreover, to better learn the transient characteristics of target events, a frame-wise classifier is designed to perform the final classification. In addition, a two-stage training with the CNN-based discriminator initialization is further proposed to enhance the system training. All experiments are performed on the DCASE 2018 Task3 datasets. Results show that our proposal significantly outperforms the official baseline on cross-domain conditions in AUC by relative 1.8−12.1% without any performance degradation on in-domain evaluation conditions.

[1]  S. Liaqat,et al.  DOMAIN TUNING METHODS FOR BIRD AUDIO DETECTION Technical Report , 2022 .

[2]  Dmitriy Serdyuk,et al.  Unsupervised adversarial domain adaptation for acoustic scene classification , 2018, ArXiv.

[3]  Ye Wang,et al.  A-CRNN: A Domain Adaptation Model for Sound Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[5]  Denis Jouvet,et al.  Metric Learning Loss Functions to Reduce Domain Mismatch in the x-Vector Space for Language Recognition , 2020, INTERSPEECH.

[6]  Daniel P. W. Ellis,et al.  Audio tagging with noisy labels and minimal supervision , 2019, DCASE.

[7]  Jian Shen,et al.  Wasserstein Distance Guided Representation Learning for Domain Adaptation , 2017, AAAI.

[8]  Paul Roe,et al.  3D convolution recurrent neural networks for bird sound detection , 2018 .

[9]  Ryo Masumura,et al.  Domain adaptation of DNN acoustic models using knowledge distillation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Florian Metze,et al.  A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mario Lasseck Acoustic bird detection with deep convolutional neural networks , 2018, DCASE.

[12]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[13]  George Trigeorgis,et al.  Domain Separation Networks , 2016, NIPS.

[14]  Chin-Hui Lee,et al.  Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification , 2020, INTERSPEECH.

[15]  M. Kosmider,et al.  CALIBRATING NEURAL NETWORKS FOR SECONDARY RECORDING DEVICES Technical Report , 2019 .

[16]  Thomas Grill,et al.  Two convolutional neural networks for bird detection in audio signals , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[17]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hervé Glotin,et al.  Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge , 2018, Methods in Ecology and Evolution.

[19]  Yong Xu,et al.  Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Lukás Burget,et al.  Speaker Verification Using End-to-end Adversarial Language Adaptation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Vincent Lostanlen,et al.  Robust sound event detection in bioacoustic sensor networks , 2019, PloS one.

[22]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[23]  Paul Primus,et al.  Bird Audio Detection-DCASE 2018 , 2018 .

[24]  Kate Saenko,et al.  Return of Frustratingly Easy Domain Adaptation , 2015, AAAI.

[25]  Yi-Zhe Song,et al.  The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification , 2020, IEEE Transactions on Image Processing.

[26]  Suwon Shon,et al.  Domain Mismatch Robust Acoustic Scene Classification Using Channel Information Conversion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Lukás Burget,et al.  Discriminative training and channel compensation for acoustic language recognition , 2008, INTERSPEECH.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  John H. L. Hansen,et al.  On Multi-Domain Training and Adaptation of End-to-End RNN Acoustic Models for Distant Speech Recognition , 2017, INTERSPEECH.