Visual Cue Cluster Construction via Information Bottleneck Principle and Kernel Density Estimation

Recent research in video analysis has shown a promising direction, in which mid-level features (e.g., people, anchor, indoor) are abstracted from low-level features (e.g., color, texture, motion, etc.) and used for discriminative classification of semantic labels. However, in most systems, such mid-level features are selected manually. In this paper, we propose an information-theoretic framework, visual cue cluster construction (VC3), to automatically discover adequate mid-level features. The problem is posed as mutual information maximization, through which optimal cue clusters are discovered to preserve the highest information about the semantic labels. We extend the Information Bottleneck framework to high-dimensional continuous features and further propose a projection method to map each video into probabilistic memberships over all the cue clusters. The biggest advantage of the proposed approach is to remove the dependence on the manual process in choosing the mid-level features and the huge labor cost involved in annotating the training corpus for training the detector of each mid-level feature. The proposed VC3 framework is general and effective, leading to exciting potential in solving other problems of semantic video analysis. When tested in news video story segmentation, the proposed approach achieves promising performance gain over representations derived from conventional clustering techniques and even the mid-level features selected manually.

[1]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[2]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[3]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Shiri Gordon,et al.  Applying the information bottleneck principle to unsupervised clustering of discrete and continuous image representations , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Shih-Fu Chang,et al.  Discovery and fusion of salient multimodal features toward news story segmentation , 2003, IS&T/SPIE Electronic Imaging.

[8]  Václav Hlaváč,et al.  Statistical Pattern Recognition Toolbox for Matlab User's guide , 2004 .

[9]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[10]  Qi Tian,et al.  A Two-Level Multi-Modal Approach for Story Segmentation of Large News Video Corpus , 2003, TRECVID.

[11]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[12]  Shih-Fu Chang,et al.  COLUMBIA-IBM NEWS VIDEO STORY SEGMENTATION IN TRECVID 2004 , 2005 .

[13]  Shih-Fu Chang,et al.  Generative, discriminative, and ensemble learning on multi-modal perceptual fusion toward news video story segmentation , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[14]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.