Modeling sequential natural behavior based on visual routines

This research models human behavior as sequential assembly of primitive actions, and analyzes natural behavior by building computational models in routine space in a situated manor. There are four contributions presented in this thesis. First, we built a real-time computer vision system that can simulate basic human visuomotor behaviors. A major psychological background is the theory of visual routines, which hypothesize that there exists in the brain a collection of built-in programs, which are coded with a fixed set of basic visual primitives and can be reprogrammed to carry out various visual tasks. Despite the general recognition of the value of such an approach and several pieces of biological evidence supporting it, no detailed successful model of visual routines has been built. Furthermore, no situated model has been described that acknowledges the specific advantages conferred on such an approach by the human body and elaborate eye-movement system. Second, we introduced a general framework for task decomposition and modeling. The internal structures of natural tasks are extracted by mining common segments in behavioral routines, taking advantage of interpersonal variations in solving identical tasks. With routine segmentation, a subtask-level decomposition of a task is achieved by abstracting routine segments as macros, which correspond to subtasks. The variations in execution orders of subtasks are then captured by a Markov model, which, in turn, can synthesize new routines to solve the task being modeled. Third, we proposed a computational model for anticipatory fixations. Anticipatory fixations are proactive eye movements to task relevant objects that are to be used in the future. Observed recently, these fixations convey important information about cognitive planning and visual memory, only a handful of researchers have seriously studied it empirically, few have attempted to explain it, and none have proposed a computational model for it. Our leaky bucket model relates anticipatory fixations to the decay of visual memory, demand of gaze, and task planning. Fourth, we built a behavior recognition system for human object interactions. The task model greatly reduces the ambiguity in Bayesian inference, by providing prior information about planning and timing of task execution. As observed evidences, visually attended objects are recognized by processing video taken by head mounted camera and analyzing eye movements in head coordinate. Hand movements and clock reading are also used in the inference, which can recognize human behavior very precisely.