Fully Deep Neural Networks Incorporating Unsupervised Feature Learning for Audio Tagging

In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a fully deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk are fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can map the audio features sequence to a multi-tag vector. For the unsupervised feature learning, we propose to use a deep auto-encoder (AE) to generate new features with non-negative representation from the basic features. The new feature can further improve the performance of audio tagging. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the dropout and background noise aware training, to enhance the generalization capability of DNNs for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method is able to utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtains a 19.1% relative improvement compared with the official GMM-based baseline method of DCASE 2016 audio tagging task.

[1]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[2]  Ning Ma,et al.  The CHiME corpus: a resource and a challenge for computational hearing in multisource environments , 2010, INTERSPEECH.

[3]  Hermann Ney,et al.  Computing Mel-frequency cepstral coefficients on the power spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Xiaoli Z. Fern,et al.  Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. , 2012, The Journal of the Acoustical Society of America.

[5]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[6]  Lie Lu,et al.  Unsupervised content discovery in composite audio , 2005, MULTIMEDIA '05.

[7]  Roger B. Dannenberg,et al.  Segmentation, Clustering, and Display in a Personal Audio Database for Musicians , 2011, ISMIR.

[8]  Li-Rong Dai,et al.  A hybrid fragment / syllable-based system for improved OOV term detection , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[9]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[10]  Adrian Ulges,et al.  Multiple Instance Learning from Weakly Labeled Videos , 2009 .

[11]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[13]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[14]  Jonathan Foote,et al.  Automatic audio segmentation using a measure of audio novelty , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[15]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Jun Du,et al.  Dynamic noise aware training for speech enhancement based on deep neural networks , 2014, INTERSPEECH.

[17]  Stefan Launer,et al.  Automatic Sound Classification Inspired by Auditory Scene Analysis , 2001 .

[18]  Jon Barker,et al.  Chime-home: A dataset for sound source recognition in a domestic environment , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[19]  Frank Rudzicz,et al.  Subject independent identification of breath sounds components using multiple classifiers , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Dan Stowell,et al.  Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning , 2014, PeerJ.

[21]  Benjamin Schrauwen,et al.  Multiscale Approaches To Music Audio Feature Learning , 2013, ISMIR.

[22]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[23]  Jun Du,et al.  Global variance equalization for improving deep neural network based speech enhancement , 2014, 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP).

[24]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[25]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[26]  Jun Du,et al.  Deep neural network based speech separation for robust speech recognition , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[27]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[28]  Douglas Eck,et al.  Automatic Identification of Instrument Classes in Polyphonic and Poly-Instrument Audio , 2009, ISMIR.

[29]  Alexander Kain,et al.  Automatic classification of breathing sounds during sleep , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[31]  B. Bridge,et al.  Automatic traffic monitoring by intelligent sound detection , 1999 .

[32]  Daniel P. W. Ellis,et al.  Spectral vs. spectro-temporal features for acoustic event detection , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[33]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[34]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Duy-Dinh Le,et al.  Multimedia Event Detection Using Event-Driven Multiple Instance Learning , 2015, ACM Multimedia.

[36]  Mohan S. Kankanhalli,et al.  Unsupervised classification of music genre using hidden Markov model , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[37]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[38]  Cordelia Schmid,et al.  Learning to Recognize Objects with Little Supervision , 2008, International Journal of Computer Vision.

[39]  Joydeep Ghosh,et al.  A Text Retrieval Approach to Content-Based Audio Hashing , 2008, International Society for Music Information Retrieval Conference.

[40]  Tara N. Sainath,et al.  Unsupervised Audio Segmentation using Extended Baum-Welch Transformations , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[41]  Gert R. G. Lanckriet,et al.  Codebook-Based Audio Feature Representation for Music Information Retrieval , 2013, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  R. Radhakrishnan,et al.  Audio analysis for surveillance applications , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[43]  Marimuthu Palaniswami,et al.  A pilot study of urban noise monitoring architecture using wireless sensor networks , 2013, 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[44]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[45]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Sandrine Brognaux,et al.  Analysis and automatic recognition of Human BeatBox sounds: A comparative study , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[48]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[49]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Heikki Huttunen,et al.  Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[51]  Gang Chen,et al.  Improve K-means clustering for audio data by exploring a reasonable sampling rate , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[52]  Aurélien Mayoue,et al.  Deep neural networks for audio scene recognition , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[53]  Arnaud Sahuguet,et al.  An audio indexing system for election video material , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[54]  Li-Rong Dai,et al.  Improved spoken term detection by template-based confidence measure , 2012, 2012 International Conference on Audio, Language and Image Processing.

[55]  Daniel P. W. Ellis,et al.  Multiple-Instance Learning for Music Information Retrieval , 2008, ISMIR.

[56]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[57]  Li-Rong Dai,et al.  Spoken term detection for OOV terms based on triphone confusion matrix , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[58]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.