Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection

We propose a novel method for Acoustic Event Detection (AED). In contrast to speech, sounds coming from acoustic events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of a clear sub-word unit. In order to incorporate the long-time frequency structure for AED, we introduce a convolutional neural network (CNN) with a large input field. In contrast to previous works, this enables to train audio event detection end-to-end. Our architecture is inspired by the success of VGGNet and uses small, 3x3 convolutions, but more depth than previous methods in AED. In order to prevent over-fitting and to take full advantage of the modeling capabilities of our network, we further propose a novel data augmentation method to introduce data variation. Experimental results show that our CNN significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.

[1]  David Heckerman,et al.  A Tractable Inference Algorithm for Diagnosing Multiple Diseases , 2013, UAI.

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Satoshi Nakamura,et al.  Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition , 2000, LREC.

[4]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[5]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[6]  R. Radhakrishnan,et al.  Audio analysis for surveillance applications , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[7]  Andrey Temko,et al.  Comparison of Sequence Discriminant Support Vector Machines for Acoustic Event Classification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Augusto Sarti,et al.  Scream and gunshot detection and localization for audio-surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[10]  Jesse S. Jin,et al.  Audio keywords generation for sports video analysis , 2008, TOMCCAP.

[11]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Daniel P. W. Ellis,et al.  Audio-Based Semantic Concept Classification for Consumer Video , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Thomas S. Huang,et al.  Real-world acoustic event detection , 2010, Pattern Recognit. Lett..

[14]  Antoine Liutkus,et al.  Informed source separation: Source coding meets source separation , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[15]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[16]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[17]  Hanseok Ko,et al.  Selective Background Adaptation Based Abnormal Acoustic Event Recognition for Audio Surveillance , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[18]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Bernard Mérialdo,et al.  Video Summarization Based on Balanced AV-MMR , 2012, MMM.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Murat Akbacak,et al.  Bag-of-Audio-Words Approach for Multimedia Event Classification , 2012, INTERSPEECH.

[23]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[24]  Zvi Kons,et al.  Audio event classification using deep neural networks , 2013, INTERSPEECH.

[25]  Chin-Hui Lee,et al.  A blind segmentation approach to acoustic event detection based on i-vector , 2013, INTERSPEECH.

[26]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[27]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[28]  Florian Metze,et al.  Improved audio features for large-scale multimedia event detection , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[29]  Heikki Huttunen,et al.  Recognition of acoustic events using deep neural networks , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[30]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Jiqing Han,et al.  Robust minimum statistics project coefficients feature for acoustic environment recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yu Tsao,et al.  Sparse representation with temporal max-smoothing for acoustic event detection , 2015, INTERSPEECH.

[33]  Jiajun Wu,et al.  Deep multiple instance learning for image classification and auto-annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Huy Phan,et al.  Representing nonspeech audio signals through speech classification models , 2015, INTERSPEECH.

[35]  Hanseok Ko,et al.  Acoustic event recognition using dominant spectral basis vectors , 2015, INTERSPEECH.

[36]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[37]  Masakiyo Fujimoto,et al.  Exploiting spectro-temporal locality in deep learning based acoustic event detection , 2015, EURASIP J. Audio Speech Music. Process..

[38]  Myung Jong Kim,et al.  Robust sound event classification using LBP-HOG based bag-of-audio-words feature representation , 2015, INTERSPEECH.

[39]  Jesús Favela,et al.  Scalable identification of mixed environmental sounds, recorded from heterogeneous sources , 2015, Pattern Recognit. Lett..

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).