DCASE 2018 Challenge Surrey cross-task convolutional neural network baseline

The Detection and Classification of Acoustic Scenes and Events (DCASE) consists of five audio classification and sound event detectiontasks: 1)Acousticsceneclassification,2)General-purposeaudio tagging of Freesound, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio classification. In this paper, we create a cross-task baseline system for all five tasks based on a convlutional neural network (CNN): a “CNN Baseline” system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same configuration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 andanareaunderthecurve(AUC)of0.854onTask3,asoundevent detectionF1scoreof20.8%onTask4,andanF1scoreof87.75%on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.

[1]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[2]  Onur Dikmen,et al.  Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Yong Xu,et al.  Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Kyogu Lee,et al.  Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks , 2017, DCASE.

[8]  Hervé Glotin,et al.  Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge , 2018, Methods in Ecology and Evolution.

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Mark D. Plumbley,et al.  Deep Neural Network Baseline for DCASE Challenge 2016 , 2016, DCASE.

[11]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[12]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[13]  S. Essid,et al.  SUPERVISED NONNEGATIVE MATRIX FACTORIZATION FOR ACOUSTIC SCENE CLASSIFICATION , 2016 .

[14]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[15]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Florian Metze,et al.  A comparison of Deep Learning methods for environmental sound detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[20]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[21]  Daniel P. W. Ellis,et al.  General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline , 2018, DCASE.

[22]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[23]  Marian Verhelst,et al.  The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network , 2017, DCASE.

[24]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[25]  Dan Stowell,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[26]  Buket D. Barkana,et al.  NON-SPEECH ENVIRONMENTAL SOUND CLASSIFICATION USING SVMS WITH A NEW SET OF FEATURES , 2012 .

[27]  Dan Stowell,et al.  Detection and classification of acoustic scenes and events: An IEEE AASP challenge , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[28]  Ishwar K. Sethi,et al.  Classification of general audio data for content-based retrieval , 2001, Pattern Recognit. Lett..

[29]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[30]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS FOR DCASE-2016 : A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS , 2016 .