Explicit Performance Metric Optimization for Fusion-Based Video Retrieval

We present a learning framework for fusion-based video retrieval system, which explicitly optimizes given performance metrics. Real-world computer vision systems serve sophisticated user needs, and domain-specific performance metrics are used to monitor the success of such systems. However, the conventional approach for learning under such circumstances is to blindly minimize standard error rates and hope the targeted performance metrics improve, which is clearly suboptimal. In this work, a novel scheme to directly optimize such targeted performance metrics during learning is developed and presented. Our experimental results on two large consumer video archives are promising and showcase the benefits of the proposed approach.

[1]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[2]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[3]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[4]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[5]  S. Haykin,et al.  Pattern Recognition Using a Family of Design Algorithms Based upon the Generalized Probabilistic Descent Method , 2001 .

[6]  Baoxin Li,et al.  YouTubeCat: Learning to categorize wild web videos , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[8]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[9]  Chin-Hui Lee,et al.  A MFoM learning approach to robust multiclass multi-label text categorization , 2004, ICML.

[10]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[12]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[13]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.

[14]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Luciano Sbaiz,et al.  Finding meaning on YouTube: Tag recommendation and category discovery , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Vincent Lepetit,et al.  Pareto-optimal dictionaries for signatures , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  George Toderici,et al.  Discriminative tag learning on YouTube videos with latent sub-tags , 2011, CVPR 2011.

[19]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.