Joint Design of Data Analysis Algorithms and User Interfaces for Video Applications

The graphical modeling paradigm provides a way of representing data through hidden causes of variability which can be estimated from the data in an unsupervised manner. Recently a lot of research has been dedicated to finding efficient inference and learning engines for graphical models in general, as well as to finding various ways of using graphical models to perform recognition, classification, segmentation, and tracking tasks in video applications. Little research, however, has focused on another advantage of a graphical model by discovering the structural elements in the data, it renders the data much easier to browse, manipulate, or interact with. In this paper, we present several ideas on how the user interface and the data analysis tools can be designed jointly starting from an appropriate data representation scheme and a generative model based on it. We base our approach on three basic principles: • Compatibility of the graphical model’s structure with our own perception of the world • Simplicity in representation, leading to more efficient inference • Providing intuitive interactivity on the level of hidden causes of variability 1 The graphical model should reflect the structure of the world We think about the world in terms of scenes, objects, and motion. While these are often primarily visual objects, we associate with them other sensory stimuli, such as sounds and smells. The basic components of a scene sometimes interact in a structured way to form a distinct activity, or even a story about the objects. In our memory of events, the time axis is warped and sometimes stretched but most of the time compressed, and the key elements are the objects and their relationships. At the very basic level, our distant memories are mostly of scenes and short activities that give a sense of our past experience, rather than detailed and precise accounts of what had really happened. Video is a recording of the world, which even though it can contain very rich visual and auditory information, is provided in a completely different form a set of pixels, waveforms, and possibly short term motion vectors as part of the compression scheme. In order to have an intuitive interface with this data, we have to transform it into the form closer to what we store in our memory. Graphical modeling approach to data analysis is compatible with this goal. We can easily describe the data formation process as a combining multiple objects and scene background to form a video frame. Such a model would have as hidden variables for each frame the positions and orientations of the objects, ordering of objects that defines the occlusions in the scene, and illumination characteristics of the scene. At the higher level, additional hidden variables could control how the higher level variables change through time, thus capturing the motion or illumination patterns. As parameters that apply to all frames, we can have descriptors of all objects that appear in all frames, meta properties of illumination sources, priors on motion patterns, etc. In principle, given lots of data, e.g. an hour of vacation video, completely unsupervised parameter estimation (learning) would result in a summary of all objects, likely motion patterns, etc., while the inference result for each frame would consist of the information on presence/absence, position and orientation of each object, brightness of the frame, etc. A number of inference strategies have been developed in the machine learning community to help with this task. 2 The model should be simple enough to reduce the tractability issue in inference Given the state of the computer graphics today, we can be tempted to use a very detailed, almost perfect model of the world with hidden variables that interact in a very nonlinear manner. For instance, we may be tempted to model the world in terms of the full 3-D structure of each object, and generation of each frame based on ray-tracing. There has even been some examples of amazingly successful fitting of detailed 3-D models to image data in specialized applications, e.g. [1]. However, in order to start only with the model structure, and no other prior knowledge on the shape, size and position of the objects, it is important to simplify the model as much as possible to allow for robust unsupervised learning. Thus representations that keep the basic notion of the world structure and for which efficient inference and learning techniques can be developed are preferable. For instance, in Fig. 1, we illustrate a graphical model that treats objects as bitmaps with per-pixel defined noise, that are combined with the shape defined by a transparency map [2]. The transformation variable can take on a finite number of values, each defining at least a different translation in the image. The objects are generated in a number of layers, and the final image is the composition of these layers based on the transparency maps. In addition to the number of discrete variables that make the number of configurations large, the only other nonlinearity in the model is the sprite composition equation, x = TLmL ∗TLsL +TLmL∗ (TL−1mL−1 ∗TL−1sL−1 +TL−1mL−1∗ (TL−2mL−2 ∗TL−2sL−2 +TL−2mL−2∗ . . . (T1m1 ∗T1s1 +T1m1 ∗ s0))) + noise. (1) We have shown that due to the particular way they interact with the sprite appearances, the translation hidden variables can be efficiently dealt with in the FFT domain both in the E and the M step of the learning algorithm [3]. The product in (1) still renders the exact inference intractable, but it turns out that variational inference that treats the posterior as Gaussian and decoupled, making the process tractable and yet not too far from the expected result of exact inference. 3 The interaction should be based on visual manipulation of the inferred hidden variables Having defined the model structure, we can start with an empty model and a video sequence and use an appropriate learning algorithm to fill in the blanks e.g., the parameters defining average object appearance and shape in the sequence, and the hidden causes of variability in each frame. e.g., the transformation variables and the current segmentation of the objects, defined by the posterior distributions p(Ti,t|xt) and p(si,t|xt). The interaction with the data can be done directly by interacting with the inferred parameters and hidden variables. For instance, if we are interested in all parts of the video in which a certain object was visible, we can click on the learned bitmap representing this object at the top of the screen. If we want to edit the sequence, changing the motion parameters, for example, we can directly edit the transformation variables and regenerate the sequence. We can also easily remove or insert objects via drag and drop interactions. This high level of interactivity is enabled by a bidirectional mapping between the hidden causes of variability in the data and the pixels in the video. For each pixel in the video, the posterior distribution over transformation and segmentation masks gives us the information about which class of object it (a) Scene/object bar

[1]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[2]  Brendan J. Frey,et al.  Fast, Large-Scale Transformation-Invariant Clustering , 2001, NIPS.

[3]  Brendan J. Frey,et al.  Real-time On-line Learning of Transformed Hidden Markov Models from Video , 2003, AISTATS.

[4]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.