Discovering Video Shot Categories by Unsupervised Stochastic Graph Partition

Video shots are often treated as the basic elements for retrieving information from videos. In recent years, video shot categorization has received increasing attention, but most of the methods involve a procedure of supervised learning, i.e., training a multi-class predictor (classifier) on the labeled data. In this paper, we study a general framework to unsupervisedly discover video shot categories. The contributions are three-fold in feature, representation, and inference: (1) A new feature is proposed to capture local information in videos, defined with small video patches (e.g., 11 × 11 × 5 pixels). A dictionary of video words can be thus clustered off-line, characterizing both appearance and motion dynamics. (2) We pose the problem of categorization as an automated graph partition task, in that each graph vertex represents a video shot, and a partitioned sub-graph consisting of connected graph vertices represents a clustered category. The model of each video shot category can be analytically calculated by a projection pursuit type of learning process. (3) An MCMC-based cluster sampling algorithm, namely Swendsen-Wang cuts, is adopted to efficiently solve the graph partition. Unlike traditional graph partition techniques, this algorithm is able to explore the nearly global optimal solution and eliminate the need for good initialization. We apply our method on a wide variety of 1600 video shots collected from Internet as well as a subset of TRECVID 2010 data, and two benchmark metrics, i.e., Purity and Conditional Entropy, are adopted for evaluating performance. The experimental results demonstrate superior performance of our method over other popular state-of-the-art methods.

[1]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[2]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[3]  Dengxin Dai,et al.  Discovering scene categories by information projection and cluster sampling , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Yongtian Wang,et al.  Object categorization with sketch representation and generalized samples , 2012, Pattern Recognit..

[5]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[6]  Yun Fu,et al.  Human Motion Tracking by Temporal-Spatial Local Gaussian Process Experts , 2011, IEEE Transactions on Image Processing.

[7]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Yunde Jia,et al.  Spatio-temporal patches for night background modeling by subspace learning , 2008, 2008 19th International Conference on Pattern Recognition.

[9]  Zhi-Qiang Liu,et al.  Investigation on unsupervised clustering algorithms for video shot categorization , 2007, Soft Comput..

[10]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11]  Jianping Fan,et al.  ClassView: hierarchical video shot classification, indexing, and accessing , 2004, IEEE Transactions on Multimedia.

[12]  Song-Chun Zhu,et al.  Generalizing Swendsen–Wang for Image Analysis , 2007, Journal of Computational and Graphical Statistics.

[13]  William T. Freeman,et al.  Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  Jianping Fan,et al.  Incorporating Concept Ontology for Hierarchical Video Classification, Annotation, and Visualization , 2007, IEEE Transactions on Multimedia.

[15]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[16]  Chong-Wah Ngo,et al.  On clustering and retrieval of video shots through temporal slices analysis , 2002, IEEE Trans. Multim..

[17]  Hai Jin,et al.  Trajectory parsing by cluster sampling in spatio-temporal graph , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Liang Lin,et al.  Layered Graph Matching with Composite Cluster Sampling , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Hai Jin,et al.  Adaptive Object Tracking by Learning Hybrid Template Online , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[20]  Wang,et al.  Nonuniversal critical dynamics in Monte Carlo simulations. , 1987, Physical review letters.

[21]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[22]  Silvio Savarese,et al.  Video scene categorization by 3D hierarchical histogram matching , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[23]  James M. Rehg,et al.  Movie genre classification via scene categorization , 2010, ACM Multimedia.

[24]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[25]  Qi Tian,et al.  A unified framework for semantic shot classification in sports video , 2002, IEEE Transactions on Multimedia.

[26]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[27]  Shuicheng Yan,et al.  SIFT-Bag kernel for video event analysis , 2008, ACM Multimedia.

[28]  Anil K. Jain,et al.  Image classification for content-based indexing , 2001, IEEE Trans. Image Process..

[29]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[30]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Christoph H. Lampert,et al.  Unsupervised Object Discovery: A Comparison , 2010, International Journal of Computer Vision.

[32]  Eli Shechtman,et al.  Space-Time Behavior-Based Correlation-OR-How to Tell If Two Underlying Motion Fields Are Similar Without Computing Them? , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Yunde Jia,et al.  Pursuing Atomic Video Words by Information Projection , 2010, ACCV.

[34]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[35]  Ming Yang,et al.  Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor , 2009, ACM Multimedia.

[36]  Chin-Teng Lin,et al.  LDA-Based Clustering Algorithm and Its Application to an Unsupervised Feature Extraction , 2011, IEEE Transactions on Fuzzy Systems.

[37]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[39]  Tai Sing Lee,et al.  Image Representation Using 2D Gabor Wavelets , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Liang Lin,et al.  Representing and recognizing objects with massive local image patches , 2012, Pattern Recognit..

[41]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[42]  Yanan Liu,et al.  Multi-modality video shot clustering with tensor representation , 2008, Multimedia Tools and Applications.

[43]  Jake Porway,et al.  A stochastic graph grammar for compositional object representation and recognition , 2009, Pattern Recognit..

[44]  Andrew Zisserman,et al.  Scene Classification Via pLSA , 2006, ECCV.

[45]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[46]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[47]  Song-Chun Zhu,et al.  Learning explicit and implicit visual manifolds by information projection , 2010, Pattern Recognit. Lett..

[48]  Tsuhan Chen,et al.  Unsupervised Image Categorization and Object Localization using Topic Models and Correspondences between Images , 2007, 2007 IEEE 11th International Conference on Computer Vision.