Deep Neural Network Baseline for DCASE Challenge 2016

The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. In Task 1 we obtained accuracy of 81.0% using Mel + DNN against 77.2% by using Mel Frequency Cepstral Coefficients (MFCCs) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 12.6% using Mel + DNN against 37.0% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 36.3% using Mel + DNN against 23.7% by using MFCCs + GMM. In Task 4 we obtained Equal Error Rate (ERR) of 18.9% using Mel + DNN against 20.9% by using MFCCs + GMM. Therefore the DNN improves the baseline in Task 1, 3, and 4, although it is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always perform better than the baselines.

[1]  Guillaume Lemaitre,et al.  Real-Time Detection of Overlapping Sound Events with Non-Negative Matrix Factorization , 2013 .

[2]  Judith C. Brown,et al.  An efficient algorithm for the calculation of a constant Q transform , 1992 .

[3]  Chaitali Chakrabarti,et al.  Lifelogging: Archival and retrieval of continuously recorded audio using wearable devices , 2012, 2012 IEEE International Conference on Emerging Signal Processing Applications.

[4]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[5]  Tuomas Virtanen,et al.  Context-dependent sound event detection , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[6]  Buket D. Barkana,et al.  NON-SPEECH ENVIRONMENTAL SOUND CLASSIFICATION USING SVMS WITH A NEW SET OF FEATURES , 2012 .

[7]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Birger Kollmeier,et al.  On the use of spectro-temporal features for the IEEE AASP challenge ‘detection and classification of acoustic scenes and events’ , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[9]  Jon Barker,et al.  Chime-home: A dataset for sound source recognition in a domestic environment , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[10]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[11]  Li Deng,et al.  A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Andreas Spanias,et al.  Segmentation, Indexing, and Retrieval for Environmental and Natural Sounds , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[15]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[16]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[18]  Yan Song,et al.  Robust Sound Event Classification Using Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Martial Michel,et al.  The CLEAR 2007 Evaluation , 2007, CLEAR.