Local Expert Forest of Score Fusion for Video Event Classification

We address the problem of complicated event categorization from a large dataset of videos "in the wild", where multiple classifiers are applied independently to evaluate each video with a 'likelihood' score. The core contribution of this paper is a local expert forest model for meta-level score fusion for event detection under heavily imbalanced class distributions. Our motivation is to adapt to performance variations of the classifiers in different regions of the score space, using a divide-and-conquer technique. We propose a novel method to partition the likelihood-space, being sensitive to local label distributions in imbalanced data, and train a pair of locally optimized experts each time. Multiple pairs of experts based on different partitions ('trees') form a 'forest', balancing local adaptivity and over-fitting of the model. As a result, our model disregards classifiers in regions of the score space where their performance is bad, achieving both local source selection and fusion. We experiment with the TRECVID Multimedia Event Detection (MED) dataset, detecting 15 complicated events from around 34k video clips comprising more than 1000 hours, and demonstrate superior performance compared to other score-level fusion methods.

[1]  Eric Horvitz,et al.  On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers , 2005, AISTATS.

[2]  Joo-Hwee Lim,et al.  An ensemble classifier learning approach to ROC optimization , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[3]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[4]  Tae-Kyun Kim,et al.  Learning Motion Categories using both Semantic and Structural Information , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  David Elliott,et al.  In the Wild , 2010 .

[6]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[7]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Anil K. Jain,et al.  A Principled Approach to Score Level Fusion in Multimodal Biometric Systems , 2005, AVBPA.

[9]  Robert T. Collins,et al.  Likelihood Map Fusion for Visual Object Tracking , 2008, 2008 IEEE Workshop on Applications of Computer Vision.

[10]  Andrew Zisserman,et al.  Hand detection using multiple proposals , 2011, BMVC.

[11]  Bhavani Raskutti,et al.  Optimising area under the ROC curve using gradient descent , 2004, ICML.

[12]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[13]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Chin-Hui Lee,et al.  A maximal figure-of-merit (MFoM)-learning approach to robust classifier design for text categorization , 2006, ACM Trans. Inf. Syst..

[15]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[16]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Shaogang Gong,et al.  Recognition of group activities using dynamic probabilistic networks , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[18]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[19]  Chin-Hui Lee,et al.  An efficient gradient computation approach to discriminative fusion optimization in semantic concept detection , 2008, 2008 19th International Conference on Pattern Recognition.

[20]  Josef Kittler,et al.  Audio- and Video-Based Biometric Person Authentication, 5th International Conference, AVBPA 2005, Hilton Rye Town, NY, USA, July 20-22, 2005, Proceedings , 2005, AVBPA.

[21]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[22]  John R. Smith,et al.  Normalized classifier fusion for semantic visual concept detection , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[23]  WuWen,et al.  A maximal figure-of-merit (MFoM)-learning approach to robust classifier design for text categorization , 2006 .

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Gang Yu,et al.  Fast Action Detection via Discriminative Random Forest Voting and Top-K Subvolume Search , 2011, IEEE Transactions on Multimedia.