Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging

Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper, we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multilabel classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian mixture model baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed asyDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17.

[1]  Jonathan Foote,et al.  Automatic audio segmentation using a measure of audio novelty , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[2]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[3]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[4]  Gang Chen,et al.  Improve K-means clustering for audio data by exploring a reasonable sampling rate , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[5]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[7]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[8]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[9]  Sandrine Brognaux,et al.  Analysis and automatic recognition of Human BeatBox sounds: A comparative study , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[11]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[12]  Aurélien Mayoue,et al.  Deep neural networks for audio scene recognition , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[13]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[14]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[15]  Arnaud Sahuguet,et al.  An audio indexing system for election video material , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Jun Du,et al.  Dynamic noise aware training for speech enhancement based on deep neural networks , 2014, INTERSPEECH.

[19]  Stefan Launer,et al.  Automatic Sound Classification Inspired by Auditory Scene Analysis , 2001 .

[20]  Jon Barker,et al.  Chime-home: A dataset for sound source recognition in a domestic environment , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[21]  Daniel P. W. Ellis,et al.  Multiple-Instance Learning for Music Information Retrieval , 2008, ISMIR.

[22]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Mohan S. Kankanhalli,et al.  Unsupervised classification of music genre using hidden Markov model , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[24]  Roger B. Dannenberg,et al.  Segmentation, Clustering, and Display in a Personal Audio Database for Musicians , 2011, ISMIR.

[25]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[27]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[28]  Toni Heittola,et al.  DOMESTIC AUDIO TAGGING WITH CONVOLUTIONAL NEURAL NETWORKS , 2016 .

[29]  Duy-Dinh Le,et al.  Multimedia Event Detection Using Event-Driven Multiple Instance Learning , 2015, ACM Multimedia.

[30]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Jim Austin,et al.  Learning criteria for training neural network classifiers , 2005, Neural Computing & Applications.

[32]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[34]  Cordelia Schmid,et al.  Learning to Recognize Objects with Little Supervision , 2008, International Journal of Computer Vision.

[35]  Joydeep Ghosh,et al.  A Text Retrieval Approach to Content-Based Audio Hashing , 2008, International Society for Music Information Retrieval Conference.

[36]  Douglas Eck,et al.  Automatic Identification of Instrument Classes in Polyphonic and Poly-Instrument Audio , 2009, ISMIR.

[37]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[38]  Daniel P. W. Ellis,et al.  Spectral vs. spectro-temporal features for acoustic event detection , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[39]  Thomas Lidy,et al.  CQT-based Convolutional Neural Networks for Audio Scene Classification , 2016, DCASE.

[40]  Dan Stowell,et al.  Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning , 2014, PeerJ.

[41]  Benjamin Schrauwen,et al.  Multiscale Approaches To Music Audio Feature Learning , 2013, ISMIR.

[42]  Sungrack Yun,et al.  Discriminative training of GMM parameters for audio scene classification and audio tagging , 2016 .

[43]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[44]  Ning Ma,et al.  The CHiME corpus: a resource and a challenge for computational hearing in multisource environments , 2010, INTERSPEECH.

[45]  Hermann Ney,et al.  Computing Mel-frequency cepstral coefficients on the power spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[46]  Lie Lu,et al.  Unsupervised content discovery in composite audio , 2005, MULTIMEDIA '05.

[47]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[48]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Heikki Huttunen,et al.  Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[50]  Tara N. Sainath,et al.  Unsupervised Audio Segmentation using Extended Baum-Welch Transformations , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[51]  Gert R. G. Lanckriet,et al.  Codebook-Based Audio Feature Representation for Music Information Retrieval , 2013, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  R. Radhakrishnan,et al.  Audio analysis for surveillance applications , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[53]  Michael E. Schuckers Receiver Operating Characteristic Curve and Equal Error Rate , 2010 .

[54]  Marimuthu Palaniswami,et al.  A pilot study of urban noise monitoring architecture using wireless sensor networks , 2013, 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[55]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[56]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  Daniel P. W. Ellis,et al.  Song-Level Features and Support Vector Machines for Music Classification , 2005, ISMIR.

[58]  Shiliang Zhang,et al.  Improving deep neural networks for LVCSR using dropout and shrinking structure , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).