Audio concept classification with Hierarchical Deep Neural Networks

Audio-based multimedia retrieval tasks may identify semantic information in audio streams, i.e., audio concepts (such as music, laughter, or a revving engine). Conventional Gaussian-Mixture-Models have had some success in classifying a reduced set of audio concepts. However, multi-class classification can benefit from context window analysis and the discriminating power of deeper architectures. Although deep learning has shown promise in various applications such as speech and object recognition, it has not yet met the expectations for other fields such as audio concept classification. This paper explores, for the first time, the potential of deep learning in classifying audio concepts on User-Generated Content videos. The proposed system is comprised of two cascaded neural networks in a hierarchical configuration to analyze the short- and long-term context information. Our system outperforms a GMM approach by a relative 54%, a Neural Network by 33%, and a Deep Neural Network by 12% on the TRECVID-MED database.

[1]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[2]  Geoffrey E. Hinton,et al.  3D Object Recognition with Deep Belief Nets , 2009, NIPS.

[3]  Daniel P. W. Ellis,et al.  Laughter Detection in Meetings , 2004 .

[4]  Gerald Friedland,et al.  Audio Concept Ranking for Video Event Detection on User-Generated Content , 2013, SLAM@INTERSPEECH.

[5]  Murat Akbacak,et al.  Supervised acoustic concept extraction for multimedia event detection , 2012, AMVA '12.

[6]  Florian Metze,et al.  Noisemes: Manual Annotation of Environmental Noise in Audio Streams , 2012 .

[7]  Bhiksha Raj,et al.  Audio event detection from acoustic unit occurrence patterns , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Chin-Hui Lee,et al.  A blind segmentation approach to acoustic event detection based on i-vector , 2013, INTERSPEECH.

[11]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Lukás Burget,et al.  Parallel training of neural networks for speech recognition , 2010, INTERSPEECH.

[13]  Mirco Ravanelli,et al.  TANDEM-bottleneck feature combination using hierarchical Deep Neural Networks , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[14]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[15]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[16]  Martin Karafiát,et al.  Hierarchical neural net architectures for feature extraction in ASR , 2010, INTERSPEECH.

[17]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[18]  Oriol Vinyals,et al.  Comparing multilayer perceptron to Deep Belief Network Tandem features for robust ASR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[20]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .