Data-efficient weakly supervised learning for low-resource audio event detection using deep learning

We propose a method to perform audio event detection under the common constraint that only limited training data are available. In training a deep learning system to perform audio event detection, two practical problems arise. Firstly, most datasets are "weakly labelled" having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose a data-efficient training of a stacked convolutional and recurrent neural network. This neural network is trained in a multi instance learning setting for which we introduce a new loss function that leads to improved training compared to the usual approaches for weakly supervised learning. We successfully test our approach on two low-resource datasets that lack temporal labels.

[1]  Zhi-Hua Zhou,et al.  Neural Networks for Multi-Instance Learning , 2002 .

[2]  Andreas Rauber,et al.  LifeCLEF Bird Identification Task 2017 , 2017, CLEF.

[3]  Dan Stowell,et al.  Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning , 2014, PeerJ.

[4]  Dong Liu,et al.  Adaptive Pooling in Multi-instance Learning for Web Video Annotation , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[5]  Xiaoli Z. Fern,et al.  Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. , 2012, The Journal of the Acoustical Society of America.

[6]  Theodoros Damoulas,et al.  Bayesian Classification of Flight Calls with a Novel Dynamic Time Warping Kernel , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[7]  Qiang Huang,et al.  Convolutional gated recurrent neural network incorporating spatial features for audio tagging , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[8]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[9]  Chin-Chuan Han,et al.  Automatic Classification of Bird Species From Their Sounds Using Two-Dimensional Cepstral Coefficients , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Jan Schlüter,et al.  Learning to Pinpoint Singing Voice from Weakly Labeled Examples , 2016, ISMIR.

[11]  Erin M. Bayne,et al.  Recommendations for acoustic recognizer performance assessment with application to five common automated signal recognition programs , 2017 .

[12]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[13]  T. Scott Brandes,et al.  Automated sound recording and analysis techniques for bird surveys and conservation , 2008, Bird Conservation International.

[14]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Stephen R. Baillie,et al.  Species traits explain variation in detectability of UK birds , 2014 .

[16]  Florian Metze,et al.  Comparing the Max and Noisy-Or Pooling Functions in Multiple Instance Learning for Weakly Supervised Sequence Learning Tasks , 2018, INTERSPEECH.

[17]  David A. Luther,et al.  Signaller: receiver coordination and the timing of communication in Amazonian birds , 2008, Biology Letters.

[18]  Hervé Glotin,et al.  LifeCLEF Bird Identification Task 2016: The arrival of Deep learning , 2016, CLEF.

[19]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Germán Castellanos-Domínguez,et al.  Multiple Instance Learning-Based Birdsong Classification Using Unsupervised Recording Segmentation , 2015, IJCAI.

[22]  Tuomas Virtanen,et al.  Stacked convolutional and recurrent neural networks for bird audio detection , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[23]  Bhiksha Raj,et al.  Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data , 2017, ArXiv.

[24]  Thomas Pellegrini,et al.  Densely connected CNNs for bird audio detection , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[25]  David A Luther,et al.  Production and perception of communicatory signals in a noisy environment , 2009, Biology Letters.

[26]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[27]  K. Pollock,et al.  Effects of Vegetation and Background Noise on the Detection Process in Auditory Avian Point-Count Surveys , 2008 .

[28]  Yong Xu,et al.  A joint detection-classification model for audio tagging of weakly labelled data , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Sally A. Goldman,et al.  Multiple-Instance Learning of Real-Valued Data , 2001, J. Mach. Learn. Res..

[30]  Qiang Huang,et al.  Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Tuomas Virtanen,et al.  Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network , 2017, ArXiv.

[32]  Panu Somervuo,et al.  PROTAX-Sound: A probabilistic framework for automated animal sound identification , 2017, PloS one.

[33]  Ilyas Potamitis,et al.  Deep Networks tag the location of bird vocalisations on audio spectrograms , 2017, ArXiv.

[34]  Johannes Kamp,et al.  Unstructured citizen science data fail to detect long‐term population declines of common birds in Denmark , 2016 .

[35]  Xiaoli Z. Fern,et al.  A Syllable-Level Probabilistic Framework for Bird Species Identification , 2009, 2009 International Conference on Machine Learning and Applications.

[36]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.