Audio Tagging by Cross Filtering Noisy Labels

High quality labeled datasets have allowed deep learning to achieve impressive results on many sound analysis tasks. Yet, it is labor-intensive to accurately annotate large amount of audio data, and the dataset may contain noisy labels in the practical settings. Meanwhile, the deep neural networks are susceptive to those incorrect labeled data because of their outstanding memorization ability. In this article, we present a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging. Multiple representations (such as, Logmel and MFCC) are used as the input of our framework for providing more complementary information of the audio. Then, though the cooperation and interaction of two neural networks, we divide the dataset into curated and noisy subsets by incrementally pick out the possibly correctly labeled data from the noisy data. Moreover, our approach leverages the multi-task learning on curated and noisy subsets with different loss function to fully utilize the entire dataset. The noisy-robust loss function is employed to alleviate the adverse effects of incorrect labels. On both the audio tagging datasets FSDKaggle2018 and FSDKaggle2019, empirical results demonstrate the performance improvement compared with other competing approaches. On FSDKaggle2018 dataset, our method achieves state-of-the-art performance and even surpasses the ensemble models.

[1]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[2]  Shai Shalev-Shwartz,et al.  Decoupling "when to update" from "how to update" , 2017, NIPS.

[3]  Roger Zimmermann,et al.  Multi-Level Fusion based Class-aware Attention Model for Weakly Labeled Audio Tagging , 2019, ACM Multimedia.

[4]  Moncef Gabbouj,et al.  A generic audio classification and segmentation approach for multimedia indexing and retrieval , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Bhiksha Raj,et al.  Learning Sound Events From Webly Labeled Data , 2018, IJCAI.

[6]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[7]  Shao-Hu Peng,et al.  Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion , 2017, DCASE.

[8]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[9]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[10]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[11]  Xingrui Yu,et al.  How does Disagreement Help Generalization against Label Corruption? , 2019, ICML.

[12]  Mark D. Plumbley,et al.  General-purpose audio tagging from noisy labels using convolutional neural networks , 2018, DCASE.

[13]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[14]  Abhinav Gupta,et al.  Learning from Noisy Large-Scale Datasets with Minimal Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[16]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Daniel P. W. Ellis,et al.  General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline , 2018, DCASE.

[18]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[19]  Daniel P. W. Ellis,et al.  Audio tagging with noisy labels and minimal supervision , 2019, DCASE.

[20]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[21]  Ting Gong,et al.  Label-efficient audio classification through multitask learning and self-supervision , 2019, ArXiv.

[22]  Vittorio Murino,et al.  Audio Surveillance , 2014, ACM Comput. Surv..

[23]  Haibo Mi,et al.  Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network , 2018, PCM.

[24]  Feng Liu,et al.  Learning Environmental Sounds with Multi-scale Convolutional Neural Network , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[25]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[26]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yong Xu,et al.  DCASE 2018 Challenge Surrey cross-task convolutional neural network baseline , 2018, DCASE.

[29]  Daniel P. W. Ellis,et al.  Learning Sound Event Classifiers from Web Audio with Noisy Labels , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Yong Xu,et al.  Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Le Song,et al.  Iterative Learning with Open-set Noisy Labels , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[34]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[35]  Qiang Huang,et al.  Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37]  Osamu Akiyama,et al.  DCASE 2019 Task 2: Multitask Learning, Semi-supervised Learning and Model Ensemble with Noisy Data for Audio Tagging , 2019 .

[38]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[39]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Stephanie Seneff,et al.  Pitch and spectral analysis of speech based on an auditory synchrony model , 1985 .

[41]  Ivor W. Tsang,et al.  Masking: A New Perspective of Noisy Supervision , 2018, NeurIPS.

[42]  Mark D. Plumbley,et al.  Computational Analysis of Sound Scenes and Events , 2017 .

[43]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Mert R. Sabuncu,et al.  Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , 2018, NeurIPS.

[45]  Aritra Ghosh,et al.  Robust Loss Functions under Label Noise for Deep Neural Networks , 2017, AAAI.

[46]  Tara N. Sainath,et al.  Deep Learning for Audio Signal Processing , 2019, IEEE Journal of Selected Topics in Signal Processing.

[47]  Xingrui Yu,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.