Improved audio features for large-scale multimedia event detection

In this paper, we present recent experiments on using Artificial Neural Networks (ANNs), a new “delayed” approach to speech vs. non-speech segmentation, and extraction of large-scale pooling feature (LSPF) for detecting “events” within consumer videos, using the audio channel only. A “event” is defined to be a sequence of observations in a video, that can be directly observed or inferred. Ground truth is given by a semantic description of the event, and by a number of example videos. We describe and compare several algorithmic approaches, and report results on the 2013 TRECVID Multimedia Event Detection (MED) task, using arguably the largest such research set currently available. The presented system achieved the best results in most audio-only conditions. While the overall finding is that MFCC features perform best, we find that ANN as well as LSP features provide complementary information at various levels of temporal resolution. This paper provides analysis of both low-level and high-level features, investigating their relative contributions to overall system performance.

[1]  Shuang Wu,et al.  Compact Audio Representation for Event Detection in Consumer Media , 2012, INTERSPEECH.

[2]  Daniel P. W. Ellis,et al.  IBM Research and Columbia University TRECVID-2011 Multimedia Event Detection (MED) System , 2011, TRECVID.

[3]  Shuang Wu,et al.  Robust Event Detection From Spoken Content In Consumer Domain Videos , 2012, INTERSPEECH.

[4]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[5]  Brian Antonishek TRECVID 2010 – An Introduction to the Goals , Tasks , Data , Evaluation Mechanisms , and Metrics , 2010 .

[6]  Paul Over,et al.  Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[7]  Bhiksha Raj,et al.  Unsupervised hierarchical structure induction for deeper semantic analysis of audio , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[11]  Murat Akbacak,et al.  Bag-of-Audio-Words Approach for Multimedia Event Classification , 2012, INTERSPEECH.

[12]  Stavros Tsakalidis,et al.  Audio-visual fusion using bayesian model combination for web video retrieval , 2011, MM '11.

[13]  Florian Metze,et al.  Robust audio-codebooks for large-scale event detection in consumer videos , 2013, INTERSPEECH.

[14]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[15]  Daniel P. W. Ellis,et al.  Subband autocorrelation features for video soundtrack classification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Mike E. Davies,et al.  IEEE International Conference on Acoustics Speech and Signal Processing , 2008 .

[17]  Gerald Friedland,et al.  Audio Concept Ranking for Video Event Detection on User-Generated Content , 2013, SLAM@INTERSPEECH.

[18]  Murat Akbacak,et al.  Supervised acoustic concept extraction for multimedia event detection , 2012, AMVA '12.

[19]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[20]  Florian Metze,et al.  Noisemes: Manual Annotation of Environmental Noise in Audio Streams , 2012 .

[21]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[22]  Steve Renals,et al.  INTERSPEECH 2010 11th Annual Conference of the International Speech Communication Association , 2010 .

[23]  Florian Metze,et al.  Event-based Video Retrieval Using Audio , 2012, INTERSPEECH.