A scalable feature learning and tag prediction framework for natural environment sounds

Building feature extraction approaches that can effectively characterize natural environment sounds is challenging due to the dynamic nature. In this paper, we develop a framework for feature extraction and obtaining semantic inferences from such data. In particular, we propose a new pooling strategy for deep architectures, that can preserve the temporal dynamics in the resulting representation. By constructing an ensemble of semantic embeddings, we employ an l1-reconstruction based prediction algorithm for estimating the relevant tags. We evaluate our approach on challenging environmental sound recognition datasets, and show that the proposed features outperform traditional spectral features.

[1]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[2]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[4]  Dan Stowell,et al.  An Open Dataset for Research on Audio Field Recording Archives: freefield1010 , 2013, Semantic Audio.

[5]  Karthikeyan Natesan Ramamurthy,et al.  Mixing matrix estimation using discriminative clustering for blind source separation , 2013, Digit. Signal Process..

[6]  S. M. Ahadi,et al.  Environmental sound classification using spectral dynamic features , 2011, 2011 8th International Conference on Information, Communications & Signal Processing.

[7]  C.-C. Jay Kuo,et al.  Environmental sound recognition: A survey , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[11]  Dan Stowell,et al.  Detection and classification of acoustic scenes and events: An IEEE AASP challenge , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[12]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.