Transformed hidden Markov models: estimating mixture models of images and inferring spatial transformations in video sequences

In this paper we describe a novel generative model for video analysis called the transformed hidden Markov model (THMM). The video sequence is modeled as a set of frames generated by transforming a small number of class images that summarize the sequence. For each frame, the transformation and the class are discrete latent variables that depend on the previous class and transformation in the sequence. The set of possible transformations is defined in advance, and it can include a variety of transformation such as translation, rotation and shearing. In each stage of such a Markov model, a new frame is generated from a transformed Gaussian distribution based on the class/transformation combination generated by the Markov chain. This model can be viewed as an extension of a transformed mixture of Gaussians through time. We use this model to cluster unlabeled video segments and form a video summary in an unsupervised fashion. We also use the trained models to perform tracking, image stabilization and filtering. We demonstrate that the THMM is capable of combining long term dependencies in video sequences (repeating similar frames in remote parts of the sequence) with short term dependencies (such as short term image frame similarities and motion patterns) to better summarize and process a video sequence even in the presence of high levels of white or structured noise (such as foreground occlusion).

[1]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[2]  Brendan J. Frey,et al.  Estimating mixture models of images and inferring spatial transformations using the EM algorithm , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[3]  Brendan J. Frey Filling in scenes by propagating probabilities through layers and into appearance models , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Nuno Vasconcelos,et al.  Bayesian modeling of video editing and structure: semantic features for video summarization and browsing , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[6]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[7]  Biing-Hwang Juang,et al.  Maximum likelihood estimation for multivariate mixture observations of markov chains , 1986, IEEE Trans. Inf. Theory.