Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection

We propose a novel method for Acoustic Event Detection (AED). In contrast to speech, sounds coming from acoustic events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of a clear sub-word unit. In order to incorporate the long-time frequency structure for AED, we introduce a convolutional neural network (CNN) with a large input field. In contrast to previous works, this enables to train audio event detection end-to-end. Our architecture is inspired by the success of VGGNet and uses small, 3x3 convolutions, but more depth than previous methods in AED. In order to prevent over-fitting and to take full advantage of the modeling capabilities of our network, we further propose a novel data augmentation method to introduce data variation. Experimental results show that our CNN significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.

[1]  Andrey Temko,et al.  Comparison of Sequence Discriminant Support Vector Machines for Acoustic Event Classification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  David Heckerman,et al.  A Tractable Inference Algorithm for Diagnosing Multiple Diseases , 2013, UAI.

[3]  Thomas S. Huang,et al.  Real-world acoustic event detection , 2010, Pattern Recognit. Lett..

[4]  Hanseok Ko,et al.  Acoustic event recognition using dominant spectral basis vectors , 2015, INTERSPEECH.

[5]  Augusto Sarti,et al.  Scream and gunshot detection and localization for audio-surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[6]  Florian Metze,et al.  Improved audio features for large-scale multimedia event detection , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[7]  Masakiyo Fujimoto,et al.  Exploiting spectro-temporal locality in deep learning based acoustic event detection , 2015, EURASIP J. Audio Speech Music. Process..

[8]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[9]  Myung Jong Kim,et al.  Robust sound event classification using LBP-HOG based bag-of-audio-words feature representation , 2015, INTERSPEECH.

[10]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[11]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[12]  Jiajun Wu,et al.  Deep multiple instance learning for image classification and auto-annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Satoshi Nakamura,et al.  Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition , 2000, LREC.

[15]  Hanseok Ko,et al.  Selective Background Adaptation Based Abnormal Acoustic Event Recognition for Audio Surveillance , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Zvi Kons,et al.  Audio event classification using deep neural networks , 2013, INTERSPEECH.

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[20]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[22]  Huy Phan,et al.  Representing nonspeech audio signals through speech classification models , 2015, INTERSPEECH.

[23]  Jesús Favela,et al.  Scalable identification of mixed environmental sounds, recorded from heterogeneous sources , 2015, Pattern Recognit. Lett..

[24]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[25]  Heikki Huttunen,et al.  Recognition of acoustic events using deep neural networks , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[26]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Yu Tsao,et al.  Sparse representation with temporal max-smoothing for acoustic event detection , 2015, INTERSPEECH.

[28]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[29]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Antoine Liutkus,et al.  Informed source separation: Source coding meets source separation , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[31]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[32]  Jiqing Han,et al.  Robust minimum statistics project coefficients feature for acoustic environment recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Daniel P. W. Ellis,et al.  Audio-Based Semantic Concept Classification for Consumer Video , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Chin-Hui Lee,et al.  A blind segmentation approach to acoustic event detection based on i-vector , 2013, INTERSPEECH.

[36]  Bernard Mérialdo,et al.  Video Summarization Based on Balanced AV-MMR , 2012, MMM.

[37]  Murat Akbacak,et al.  Bag-of-Audio-Words Approach for Multimedia Event Classification , 2012, INTERSPEECH.

[38]  R. Radhakrishnan,et al.  Audio analysis for surveillance applications , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[39]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.