Late fusion and calibration for multimedia event detection using few examples

The state-of-the-art in example-based multimedia event detection (MED) rests on heterogeneous classifiers whose scores are typically combined in a late-fusion scheme. Recent studies on this topic have failed to reach a clear consensus as to whether machine learning techniques can outperform rule-based fusion schemes with varying amount of training data. In this paper, we present two parametric approaches to late fusion: a normalization scheme for arithmetic mean fusion (logistic averaging) and a fusion scheme based on logistic regression, and compare them to widely used rule-based fusion schemes. We also describe how logistic regression can be used to calibrate the fused detection scores to predict an optimal threshold given a detection prior and costs on errors. We discuss the advantages and shortcomings of each approach when the amount of positives available for training varies from 10 positives (10Ex) to 100 positives (100Ex). Experiments were run using video data from the NIST TRECVID MED 2013 evaluation and results were reported in terms of a ranking metric: the mean average precision (mAP) and R0, a cost-based metric introduced in TRECVID MED 2013.

[1]  Florian Metze,et al.  Informedia @ TRECVID 2011 , 2011 .

[2]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[3]  Wen Wang,et al.  Rich system combination for keyword spotting in noisy and acoustically heterogeneous audio streams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[5]  Murat Akbacak,et al.  Extracting spoken and acoustic concepts for multimedia event detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  GeversTheo,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010 .

[7]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[8]  Koen E. A. van de Sande,et al.  Recommendations for video event recognition using concept vocabularies , 2013, ICMR.

[9]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Daniel P. W. Ellis,et al.  IBM Research and Columbia University TRECVID-2012 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), and Semantic Indexing (SIN) Systems , 2012, TRECVID.

[11]  Chin-Hui Lee,et al.  Explicit Performance Metric Optimization for Fusion-Based Video Retrieval , 2012, ECCV Workshops.

[12]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Lukás Burget,et al.  A unified approach for audio characterization and its application to speaker recognition , 2012, Odyssey.

[15]  Wei Liu,et al.  Double Fusion for Multimedia Event Detection , 2012, MMM.

[16]  Ramakant Nevatia,et al.  ACTIVE: Activity Concept Transitions in Video Event Classification , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[18]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[19]  Kan Chen,et al.  The 2013 SESAME Multimedia Event Detection and Recounting System , 2013, TRECVID.

[20]  Robert C. Bolles,et al.  Rectification and recognition of text in 3-D scenes , 2004, International Journal of Document Analysis and Recognition (IJDAR).

[21]  Cordelia Schmid,et al.  AXES at TRECVID 2012: KIS, INS, and MED , 2012, TRECVID.

[22]  Ramakant Nevatia,et al.  Evaluating multimedia features and fusion for example-based event detection , 2013, Machine Vision and Applications.

[23]  Ramakant Nevatia,et al.  Large-scale web video event classification by use of Fisher Vectors , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[24]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[25]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.