Multi-Layer Multi-Instance Learning for Video Concept Detection

This paper presents a novel learning-based method, called ldquomulti-layer multi-instance (MLMI) learning,rdquo for video concept detection. Most of existing methods have treated video as a flat data sequence and have not investigated the intrinsic hierarchy structure of the video content deeply. However, video is essentially a kind of media with ML structure. For example, a video can be represented by a hierarchical structure including, from large to small, shot, frame, and region, where each pair of contiguous layers fits the typical MI setting. We call such a ML structure and the MI relations embedded in the structure as the MLMI setting. In this paper, we systematically study both ML structure and MI relations embedded in video content by formulating video concept detection as a MLMI learning problem. Specifically, we first construct a MLMI kernel to simultaneously model such ML structure and MI relations. To deal with the ambiguity propagation problem which is introduced by weak labeling and ML structure, we then propose a regularization framework which takes hyper-bag prediction error, sublayer prediction error, inter-layer inconsistency measure, and classifier complexity into consideration. We have applied the proposed MLMI learning method to concept detection task over TRECVid 2005 development corpus, and report better performance to vector-based and the state-of-the-art MI learning methods.

[1]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[2]  Yixin Chen,et al.  Image Categorization by Learning and Reasoning with Regions , 2004, J. Mach. Learn. Res..

[3]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[5]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[6]  Thomas Gärtner,et al.  Multi-Instance Kernels , 2002, ICML.

[7]  Marcel Worring,et al.  The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  James T. Kwok,et al.  A regularization framework for multiple-instance learning , 2006, ICML.

[9]  Dong Wang,et al.  THU and ICRC at TRECVID 2007 , 2007, TRECVID.

[10]  Mikhail Belkin,et al.  Maximum Margin Semi-Supervised Learning for Structured Variables , 2005, NIPS 2005.

[11]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[12]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[13]  Rong Yan,et al.  Semi-supervised cross feature learning for semantic concept detection in videos , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  Meng Wang,et al.  Structure-sensitive manifold ranking for video concept detection , 2007, ACM Multimedia.

[15]  Luc De Raedt,et al.  Kernels and Distances for Structured Data , 2008 .

[16]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[17]  Kadir A. Peker,et al.  Framework for measurement of the intensity of motion activity of video segments , 2002, SPIE ITCom.

[18]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[19]  Rong Yan,et al.  Model-shared subspace boosting for multi-label classification , 2007, KDD '07.

[20]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[21]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[22]  Tao Mei,et al.  MILC2: A Multi-Layer Multi-Instance Learning Approach to Video Concept Detection , 2008, MMM.

[23]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[24]  John R. Smith,et al.  A generalized multiple instance learning algorithm for large scale modeling of multimedia semantics , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[25]  Sanjeev Khudanpur,et al.  Hidden Markov models for automatic annotation and content-based retrieval of images and video , 2005, SIGIR '05.

[26]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[27]  Tao Mei,et al.  Multi-layer multi-instance kernel for video concept detection , 2007, ACM Multimedia.

[28]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[29]  Thomas Hofmann,et al.  Kernel Methods for Missing Variables , 2005, AISTATS.

[30]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[31]  B. S. Manjunath,et al.  Unsupervised Segmentation of Color-Texture Regions in Images and Video , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  James T. Kwok,et al.  Marginalized Multi-Instance Kernels , 2007, IJCAI.

[33]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[34]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.