Escaping local minima through hierarchical model selection: Automatic object discovery, segmentation, and tracking in video

Recently, the generative modeling approach to video segmentation has been gaining popularity in the computer vision community. For example, the flexible sprites framework has been studied in, among other references, [11,13,14,24]. In general, detailed generative models are vulnerable to intractability of inference and local minima problems when approximations are made (see, e.g., [25]). Recent approaches to dealing with these problems focused on inference techniques for increasingly more expressive models. Simpler models, on the other hand, while less precise, are often not just faster, but less prone to local minima. In addition, while many different models may be based on similar hidden variables, some models may be more amenable to inference of some of the shared variables, while other models lead to efficient and accurate inference of other components of the hierarchical data description. In this paper, we empirically illustrate that forcing multiple models to share the posterior distribution leads to inference less prone to local minima. We define a set of key hidden variables that describe aspects of the data that we care about. The relationships among these key variables are defined through multiple conditional distribution models on the same pairs of variables, controlled by switch variables. The posterior distribution over the key hidden variables is shared, and inference of the switch variables serves as a mechanism for combinatorial model selection. The key observation here is that while the most expressive model often ends up a winner by the end of the iterative learning of model parameters, early iterations are dominated by simpler model components, and upon convergence, the free energy is lower than the ones reached by switching on all the most complex components from the beginning of the learning. We illustrate the performance of this approach on the unsupervised video segmentation task.

[1]  Terrence J. Sejnowski,et al.  Variational Learning for Switching State-Space Models , 2001 .

[2]  J. Movellan,et al.  Large-Scale Convolutional HMMs for Real-Time Video Tracking , 2003 .

[3]  Alex Pentland,et al.  Pfinder: real-time tracking of the human body , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[4]  David J. Fleet,et al.  Robust Online Appearance Models for Visual Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[6]  Naonori Ueda,et al.  Bayesian model search for mixture models based on optimizing variational bounds , 2002, Neural Networks.

[7]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[8]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[9]  Edward H. Adelson,et al.  Ordinal characteristics of transparency. , 1990 .

[10]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994, Neural Computation.

[11]  Andrew Zisserman,et al.  Learning Layered Motion Segmentation of Video , 2005, ICCV.

[12]  Cristian Sminchisescu,et al.  Building Roadmaps of Minima and Transitions in Visual Models , 2004, International Journal of Computer Vision.

[13]  Nebojsa Jojic,et al.  Consistent segmentation for optical flow estimation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[14]  Edward H. Adelson,et al.  Representing moving images with layers , 1994, IEEE Trans. Image Process..

[15]  Brendan J. Frey,et al.  A comparison of algorithms for inference and learning in probabilistic graphical models , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[17]  Max Welling,et al.  Product of experts , 2007, Scholarpedia.

[18]  Andrew Blake,et al.  Generative Affine Localisation and Tracking , 2004, NIPS.

[19]  Brendan J. Frey,et al.  Generative Model for Layers of Appearance and Deformation , 2005, AISTATS.

[20]  Geoffrey E. Hinton,et al.  Evaluation of Adaptive Mixtures of Competing Experts , 1990, NIPS.

[21]  Javier R. Movellan,et al.  Real-Time Video Tracking Using Convolution HMMs , 2004 .

[22]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[23]  Christopher K. I. Williams,et al.  Greedy Learning of Multiple Objects in Images Using Robust Statistics and Factorial Learning , 2004, Neural Computation.

[24]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[25]  Yaron Caspi,et al.  Probabilistic Index Maps for Modeling Natural Signals , 2004, UAI.

[26]  Brendan J. Frey,et al.  Learning appearance and transparency manifolds of occluded objects in layers , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..