Subband autocorrelation features for video soundtrack classification

Inspired by the system presented in [1], we have developed novel auditory-model-based features that preserve the fine time structure lost in conventional frame-based features. While the original auditory model is computationally intense, we present a simpler system that runs about ten times faster but achieves equivalent performance. We use these features for video soundtrack classification with the Columbia Consumer Video dataset, showing that the new features alone are roughly comparable to traditional MFCCs, but combining classifiers based on both features achieves a substantial mean Average Precision improvement of 15% over the MFCC baseline.

[1]  Baoxin Li,et al.  YouTubeCat: Learning to categorize wild web videos , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[3]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Samy Bengio,et al.  Sound Retrieval and Ranking Using Sparse Auditory Representations , 2010, Neural Computation.

[6]  Daniel P. W. Ellis,et al.  Soundtrack classification by transient events , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Lie Lu,et al.  Digital Object Identifier (DOI) 10.1007/s00530-002-0065-0 Multimedia Systems , 2003 .

[8]  Rainer Stiefelhagen,et al.  Content-based video genre classification using multiple cues , 2010, AIEMPro '10.

[9]  Richard F. Lyon,et al.  Sparse coding of auditory features for machine hearing in interference , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Samy Bengio,et al.  A Discriminative Approach for the Retrieval of Images from Text Queries , 2006, ECML.

[11]  Daniel P. W. Ellis,et al.  Noise Robust Pitch Tracking by Subband Autocorrelation Classification , 2012, INTERSPEECH.

[12]  Daniel P. W. Ellis,et al.  Audio-Based Semantic Concept Classification for Consumer Video , 2010, IEEE Transactions on Audio, Speech, and Language Processing.