Sound Event Detection in the DCASE 2017 Challenge

Each edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) contained several tasks involving sound event detection in different setups. DCASE 2017 presented participants with three such tasks, each having specific datasets and detection requirements: Task 2, in which target sound events were very rare in both training and testing data, Task 3 having overlapping events annotated in real-life audio, and Task 4, in which only weakly labeled data were available for training. In this paper, we present three tasks, including the datasets and baseline systems, and analyze the challenge entries for each task. We observe the popularity of methods using deep neural networks, and the still widely used mel frequency-based representations, with only few approaches standing out as radically different. Analysis of the systems behavior reveals that task-specific optimization has a big role in producing good performance; however, often this optimization closely follows the ranking metric, and its maximization/minimization does not result in universally good performance. We also introduce the calculation of confidence intervals based on a jackknife resampling procedure to perform statistical analysis of the challenge results. The analysis indicates that while the 95% confidence intervals for many systems overlap, there are significant differences in performance between the top systems and the baseline for all tasks.

[1]  William W. Gaver How Do We Hear in the World?: Explorations in Ecological Acoustics , 1993 .

[2]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[3]  A. Bregman Auditory Scene Analysis , 2001 .

[4]  Martial Michel,et al.  The CLEAR 2007 Evaluation , 2007, CLEAR.

[5]  George Forman,et al.  Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement , 2010, SKDD.

[6]  Niko Moritz,et al.  Acoustic user interfaces for ambient-assisted living technologies , 2010, Informatics for health & social care.

[7]  Tuomas Virtanen,et al.  Context-dependent sound event detection , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[8]  Bart Vanrumste,et al.  An exemplar-based NMF approach to audio event detection , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[9]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Dan Stowell,et al.  Acoustic event detection for multiple overlapping similar sources , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[12]  Onur Dikmen,et al.  Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[14]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[15]  Nicolai Petkov,et al.  Audio Surveillance of Roads: A System for Detecting Anomalous Sounds , 2016, IEEE Transactions on Intelligent Transportation Systems.

[16]  Mounya Elhilali,et al.  Abnormal sound event detection using temporal trajectories mixtures , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Bhiksha Raj,et al.  Weakly supervised scalable audio content analysis , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[18]  Axel Röbel,et al.  A Morphological Model for Simulating Acoustic Scenes and Its Application to Sound Event Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Francesco Piazza,et al.  Acoustic cues from the floor: A new approach for fall classification , 2016, Expert Syst. Appl..

[20]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[21]  Sangeun Kum,et al.  Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection , 2017, DCASE.

[22]  Yong Xu,et al.  Surrey-cvssp system for DCASE2017 challenge task4 , 2017, ArXiv.

[23]  Jia Liu,et al.  TRANSFER LEARNING BASED DNN-HMM HYBRID SYSTEM FOR RARE SOUND EVENT DETECTION , 2017 .

[24]  Il-Young Jeong,et al.  Audio Event Detection Using Multiple-Input Convolutional Neural Network , 2017, DCASE.

[25]  Tuomas Virtanen,et al.  Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network , 2017, ArXiv.

[26]  Tuomas Virtanen,et al.  A report on sound event detection with different binaural features , 2017, ArXiv.

[27]  Shengchen Li,et al.  Multi-frame Concatenation for Detection of Rare Sound Events Based on Deep Neural Network , 2017 .

[28]  S. Squartini,et al.  A HIERARCHIC MULTI-SCALED APPROACH FOR RARE SOUND EVENT DETECTION , 2017 .

[29]  Kyogu Lee,et al.  Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input , 2017, DCASE.

[30]  Rui Lu BIDIRECTIONAL GRU FOR SOUND EVENT DETECTION , 2017 .

[31]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  T. Virtanen,et al.  Convolutional Recurrent Neural Networks for Rare Sound Event Detection , 2017, DCASE.

[33]  Tuomas Virtanen,et al.  Sound event detection using spatial features and convolutional recurrent neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  J. Salamon,et al.  DCASE 2017 SUBMISSION : MULTIPLE INSTANCE LEARNING FOR SOUND EVENT DETECTION , 2017 .

[35]  Kyogu Lee,et al.  Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks , 2017, DCASE.

[36]  Mark D. Plumbley,et al.  Neuroevolution for sound event detection in real life audio: A pilot study , 2017 .

[37]  Toan H. Vu,et al.  DEEP LEARNING FOR DCASE 2017 CHALLENGE , 2017 .

[38]  Yanxiong Li,et al.  THE SEIE-SCUT SYSTEMS FOR IEEE AASP CHALLENGE ON DCASE 2017 : DEEP LEARNING TECHNIQUES FOR AUDIO REPRESENTATION AND CLASSIFICATION , 2017 .

[39]  Marian Verhelst,et al.  The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network , 2017, DCASE.

[40]  Zhiyao Duan,et al.  DCASE 2017 SOUND EVENT DETECTION USING CONVOLUTIONAL NEURAL NETWORKS , 2017 .

[41]  Huy Phan,et al.  DNN and CNN with Weighted and Multi-task Loss Functions for Audio Event Detection , 2017, ArXiv.

[42]  Luc Van Gool,et al.  AENet: Learning Deep Audio Features for Video Analysis , 2017, IEEE Transactions on Multimedia.

[43]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[44]  Dan Stowell,et al.  Data-efficient weakly supervised learning for low-resource audio event detection using deep learning , 2018, DCASE.

[45]  Hervé Glotin,et al.  Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge , 2018, Methods in Ecology and Evolution.

[46]  Daniel P. W. Ellis,et al.  General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline , 2018, DCASE.

[47]  Tuomas Virtanen,et al.  The Machine Learning Approach for Analysis of Sound Scenes and Events , 2018 .

[48]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[49]  Bin Yang,et al.  Multi-level attention model for weakly supervised audio classification , 2018, DCASE.

[50]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[51]  Patrick Pérez,et al.  Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events , 2018, CVPR Workshops.

[52]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.