Visual action search and recognition

Human action analysis is an actively growing field that provides an infrastructure for a wide range of applications spanning visual surveillance, systems for entertainment, human-computer interfaces, and so on. In this dissertation, we focus on two topics: action search and action recognition. Action search aims to search and localize predefined human actions in a large video database. The predefined actions are usually represented by short query clips. Action recognition is a vision task that identifies action types of new incoming videos using well-trained action models. These two topics overlap in many aspects and share the same technologies. We attempt to solve the problem through three approaches in terms of whether human poses are considered or not. 1. When the observed human is very small in the image and the background is cluttered, it is impractical to recover the human pose. Researchers usually learn a direct mapping from visual appearance to action type. To satisfy the purpose of representing and searching human actions in videos, we propose a five-layer hierarchical space-time model (HSTM) where invariant and selective features coexist. This model enables searching for human actions ranging from rapid sports movements to subtle facial expressions, and naturally localizes the occurrences of the queried actions in reference videos. 2. When the observed human is close to the camera, the pose is often obtainable through a generative method (such as tracking) that recovers the pose within an analysis-by-synthesis loop. Then, any sequence labeling algorithm such as hidden Markov model (HMM), conditional random fields (CRF), and grammar parsing might be used to infer the action type from the estimated human poses. In this dissertation, we propose a robust tracking algorithm that can dynamically choose optimal weights to fuse multiple cues. The tracker enables us to obtain motion trajectories of the objects of interest. Then, we use regular grammars to represent actions, and convert the problem of action detection and recognition into a regular expression matching problem. This generative framework is further proved to be useful by a real time application—a shrug detector. 3. The limitation of generative methods is the high computational cost in the inference stage, especially when the human pose is very complex, e.g., including joint angles. So they are often used to recover simple human poses such as global location (i.e., motion trajectory) in the 2D image plane. However, the discriminative methods, once trained, have the advantage of much faster test speed. We propose a discriminative pose estimator that recovers human poses from monocular images. The estimator uses the bag-of-words model to represent images and Bayesian mixtures of experts (BME) to model the image-to-pose distributions. Our contributions include the local descriptor specifically designed for pose estimation and supervised learning of visual words. It is natural to feed the output of a discriminative pose estimator to a sequence labeling algorithm, such as CRF, for action recognition. However, a powerful discriminative pose estimator requires large labeled image-to-pose training data. This limits the approach of recognizing actions from poses estimated by discriminative pose estimators. Hence, we propose a novel model that replaces the observation layer of the traditional random fields with a latent pose estimator. In training stage, the human pose is viewed as a latent variable. The advantage of this model is twofold. First, it learns to convert the high dimensional observations into more compact and informative representations under the supervision of labeled action data. In other words, it is a trainable feature extractor. Second, it enables a transfer learning to fully utilize the existing knowledge and data on image-to-pose relationship. The three approaches are applicable to different scenarios. Their effectiveness is tested on synthetic and real datasets.