Unsupervised Mining of Statistical Temporal Structures in Video

In this chapter we present algorithms for unsupervised mining of struc-tures in video using multi-scale statistical models. Video structure are repetitive segments in a video stream with consistent statistical characteristics. Such structures can often be interpreted in relation to distinctive semantics, particularly in structured domains like sports. While much work in the literature explores the link between the observations and the semantics using supervised learning, we propose unsupervised structure mining algorithms that aim at alleviating the burden of labelling and training, as well as providing a scalable solution for generalizing video indexing techniques to heterogeneous content collections such as surveillance and consumer video. Existing unsupervised video structuring work primarily uses clustering techniques, while the rich statistical characteristics in the temporal dimension at different granularities remain unexplored. Automatically identifying structures from an unknown domain poses significant challenges when domain knowledge is not explicitly present to assist algorithm design, model selection, and feature selection. In this work we model multi-level statistical structures with hierarchical hidden Markov models based on a multi-level Markov dependency assumption. The parameters of the model are efficiently estimated using the EM algorithm. We have also developed a model structure learning algorithm that uses stochastic sampling techniques to find the optimal model structure, and a feature selection algorithm that automatically finds compact relevant feature sets using hybrid wrapper-filter methods. When tested on sports videos, the unsupervised learning scheme achieves very promising results: (1) The automatically selected feature set for soccer and baseball videos matches sets that are manually selected with domain knowledge. (2) The system automatically discovers high-level structures that match the semantic events in the video. (3) The system achieves better accuracy in detecting semantic events in unlabelled soccer videos than a competing supervised approach designed and trained with domain knowledge.

[1]  Jennifer G. Dy,et al.  Feature Subset Selection and Order Identiication for Unsupervised Learning , 2000 .

[2]  Boon-Lock Yeo,et al.  Time-constrained clustering for segmentation of video into story units , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[3]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[5]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[6]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[7]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  Avideh Zakhor,et al.  Content analysis of video using principal components , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[10]  Michael Hu,et al.  A Hierarchical HMM Implementation for Vertebrate Gene Splice Site Prediction , 2000 .

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  Alex Pentland,et al.  Unsupervised clustering of ambulatory audio and video , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[13]  Shih-Fu Chang,et al.  Learning Hierarchical Hidden Markov Models for Video Structure Discovery , 2003 .

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  C R Mann THE ECHELON SPECTROSCOPE. , 1898, Science.

[16]  Mark S. Squillante,et al.  Analysis and characterization of large‐scale Web server access patterns and performance , 1999, World Wide Web.

[17]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[18]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[19]  Matthew Brand,et al.  Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter Extinction , 1999, Neural Computation.

[20]  Christophe Andrieu,et al.  Iterative algorithms for state estimation of jump Markov linear systems , 2001, IEEE Trans. Signal Process..

[21]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[22]  Shih-Fu Chang,et al.  Structure analysis of soccer video with hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[24]  Kevin P. Murphy,et al.  Linear-time inference in Hierarchical HMMs , 2001, NIPS.

[25]  Christophe Andrieu,et al.  Iterative algorithms for optimal state estimation of jump Markov linear systems , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[26]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.

[27]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[28]  Nando de Freitas,et al.  Robust Full Bayesian Learning for Radial Basis Networks , 2001, Neural Computation.

[29]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[30]  Milind R. Naphade,et al.  Discovering recurrent events in video using unsupervised methods , 2002, Proceedings. International Conference on Image Processing.