Multi-layer multi-instance kernel for video concept detection

In video concept detection, most existing methods have not well studied the intrinsic hierarchical structure of video content. However, unlike flat attribute-value data used in many existing methods, video is essentially a structured media with multi-layer representation. For example, a video can be represented by a hierarchical structure including, from large to small, shot, key-frame, and region. Moreover, it fits the typical Multi-Instance (MI) setting in which the "bag-instance" correspondence is embedded among contiguous layers. We call such multi-layer structure and the "bag-instance" relation embedded in the structure as Multi-Layer Multi-Instance (MLMI) setting in this paper. We formulate video concept detection as an MLMI learning problem in which a rooted tree with MLMI nature embedded is devised to represent a video segment. Furthermore, by fusing the information from different layers, we construct a novel MLMI kernel to measure the similarities between the instances in the same and different layers. In contrast to traditional MI learning, both the Multi-Layer structure and Multi-Instance relations are leveraged simultaneously in the proposed kernel. We applied MLMI kernel to concept detection task on TRECVID 2005 corpus and reported superior performance (+25% in Mean Average Precision) to standard Support Vector Machine based approaches.

[1]  Sanjeev Khudanpur,et al.  Hidden Markov models for automatic annotation and content-based retrieval of images and video , 2005, SIGIR '05.

[2]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[3]  Mikhail Belkin,et al.  Maximum Margin Semi-Supervised Learning for Structured Variables , 2005, NIPS 2005.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  Xuelong Li,et al.  Direct kernel biased discriminant analysis: a new content-based image retrieval relevance feedback algorithm , 2006, IEEE Transactions on Multimedia.

[6]  Thomas Gärtner,et al.  Multi-Instance Kernels , 2002, ICML.

[7]  Marcel Worring,et al.  The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  James T. Kwok,et al.  A regularization framework for multiple-instance learning , 2006, ICML.

[9]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[10]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Rong Yan,et al.  Semi-supervised cross feature learning for semantic concept detection in videos , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[13]  Rong Yan,et al.  Model-shared subspace boosting for multi-label classification , 2007, KDD '07.

[14]  James T. Kwok,et al.  Marginalized Multi-Instance Kernels , 2007, IJCAI.

[15]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[16]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[17]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, CVPR 2004.

[18]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[19]  Ajay Divakaran,et al.  Framework for measurement of the intensity of motion activity of video segments , 2004, J. Vis. Commun. Image Represent..

[20]  Yixin Chen,et al.  Image Categorization by Learning and Reasoning with Regions , 2004, J. Mach. Learn. Res..

[21]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[22]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[23]  Tao Mei,et al.  MILC2: A Multi-Layer Multi-Instance Learning Approach to Video Concept Detection , 2008, MMM.

[24]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[25]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[26]  B. S. Manjunath,et al.  Unsupervised Segmentation of Color-Texture Regions in Images and Video , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[28]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[29]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[30]  John R. Smith,et al.  A generalized multiple instance learning algorithm for large scale modeling of multimedia semantics , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..