Multi-view geometric constraints for human action recognition and tracking

Human actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints of two actors in the same posture can be described by a 3D rigid transformation. This transformation simultaneously captures different poses and various sizes and proportions. As a consequence of this conjecture, we show that there exists a fundamental matrix between the imaged positions of the body joints of two actors, if they are in the same posture. Second, we propose a novel projection model for cameras moving at a constant velocity in 3D space, Galilean cameras, and derive the Galilean fundamental matrix and apply it to human action recognition. Third, we propose a novel use for the invariant ratio of areas under an affine transformation and utilizing the epipolar geometry between two cameras for 2D model-based tracking of human body joints. In the first part of the thesis, we propose an approach to match human actions using semantic correspondences between human bodies. These correspondences are used to provide geometric constraints between multiple anatomical landmarks (e.g. hands, shoulders, and feet) to match actions observed from different viewpoints and performed at different rates by actors of differing anthropometric proportions. The fact that the human body has approximate anthropometric proportion allows for innovative use of the machinery of epipolar geometry to provide constraints for analyzing actions performed by people of different anthropometric sizes, while ensuring that changes in viewpoint do not affect matching. A novel measure in terms of rank of matrix constructed only from image measurements of the locations of anatomical landmarks is proposed to ensure that similar actions are accurately recognized. Finally, we describe how dynamic time warping can be used in conjunction with the proposed measure to match actions in the presence of nonlinear time warps. We demonstrate the versatility of our algorithm in a number of challenging sequences and applications including action synchronization, odd one out, following the leader, analyzing periodicity etc. Next, we extend the conventional model of image projection to video captured by a camera moving at constant velocity. We term such moving camera Galilean camera. To that end, we derive the spacetime projection and develop the corresponding epipolar geometry between two Galilean cameras. Both perspective imaging and linear pushbroom imaging form specializations of the proposed model and we show how six different “fundamental” matrices including the classic fundamental matrix, the Linear Pushbroom (LP) fundamental matrix, and a fundamental matrix relating Epipolar Plane Images (EPIs) are related and can be directly recovered from a Galilean fundamental matrix. We provide linear algorithms for estimating the parameters of the mapping between videos in the case of planar scenes. For applying fundamental matrix between Galilean cameras to human action recognition, we propose a measure that has two important properties. First property makes it possible to recognize similar actions, if their execution rates are linearly related. Second property allows recognizing actions in video captured by Galilean cameras. Thus, the proposed algorithm guarantees that actions can be correctly matched despite changes in view, execution rate, anthropometric proportions of the actor, and even if the camera moves with constant velocity. Finally, we also propose a novel 2D model based approach for tracking human body parts during articulated motion. The human body is modeled as a 2D stick figure of thirteen body joints and an action is considered as a sequence of these stick figures. Given the locations of these joints in every frame of a model video and the first frame of a test video, the joint locations are automatically estimated throughout the test video using two geometric constraints. First, invariance of the ratio of areas under an affine transformation is used for initial estimation of the joint locations in the test video. Second, the epipolar geometry between the two cameras is used to refine these estimates. Using these estimated joint locations, the tracking algorithm determines the exact location of each landmark in the test video using the foreground silhouettes. The novelty of the proposed approach lies in the geometric formulation of human action models, the combination of the two geometric constraints for body joints prediction, and the handling of deviations in anthropometry of individuals, viewpoints, execution rate, and style of performing action. The proposed approach does not require extensive training and can easily adapt to a wide variety of articulated actions.

[1]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[2]  Mubarak Shah,et al.  Where was the Picture Taken: Image Localization in Route Panoramas Using Epipolar Geometry , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[3]  Mubarak Shah,et al.  Tracking of Human Body Joints using Anthropometry , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[4]  William J. Christmas,et al.  Robust Player Gesture Spotting and Recognition in Low-Resolution Sports Video , 2006, ECCV.

[5]  Ying Wu,et al.  A field model for human detection and tracking , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Michal Irani,et al.  Detecting Irregularities in Images and in Video , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7]  Hong Li,et al.  Multi-scale gesture recognition from time-varying contours , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[8]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Jitendra Malik,et al.  Recovering human body configurations using pairwise constraints between parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[11]  Ramakant Nevatia,et al.  Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[12]  Patrick Pérez,et al.  Periodic motion detection and segmentation via approximate sequence alignment , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  Mubarak Shah,et al.  Recognizing human actions in videos acquired by uncalibrated moving cameras , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[14]  Mubarak Shah,et al.  Actions sketch: a novel action representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Svetha Venkatesh,et al.  Learning and detecting activities from movement trajectories using the hierarchical hidden Markov model , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16]  Bernt Schiele,et al.  Pedestrian detection in crowded scenes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17]  David A. Forsyth,et al.  Strike a pose: tracking people by finding stylized poses , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  Eli Shechtman,et al.  Space-time behavior based correlation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Richard Bowden,et al.  Detection and Tracking of Humans by Probabilistic Body Part Assembly , 2005, BMVC.

[20]  M. Shah,et al.  Actions As Objects : A Novel Action Representation , 2005 .

[21]  Mei Han,et al.  Reconstruction of a Scene with Multiple Linearly Moving Objects , 2004, International Journal of Computer Vision.

[22]  M. Shah,et al.  On the use of anthropometry in the invariant analysis of human actions , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[23]  H. Sidenbladh,et al.  Detecting human motion with support vector machines , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[24]  Stephen J. McKenna,et al.  Human Pose Estimation Using Learnt Probabilistic Region Similarities and Partial Configurations , 2004, ECCV.

[25]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[26]  Zhengyou Zhang,et al.  Determining the Epipolar Geometry and its Uncertainty: A Review , 1998, International Journal of Computer Vision.

[27]  Steven M. Seitz,et al.  View-Invariant Analysis of Cyclic Motion , 1997, International Journal of Computer Vision.

[28]  Berthold K. P. Horn,et al.  Direct methods for recovering motion , 1988, International Journal of Computer Vision.

[29]  Robert C. Bolles,et al.  Epipolar-plane image analysis: An approach to determining structure from motion , 1987, International Journal of Computer Vision.

[30]  Michael J. Black,et al.  EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation , 1996, International Journal of Computer Vision.

[31]  Jake K. Aggarwal,et al.  Human motion: modeling and recognition of actions and interactions , 2004, Proceedings. 2nd International Symposium on 3D Data Processing, Visualization and Transmission, 2004. 3DPVT 2004..

[32]  Daphna Weinshall,et al.  On the epipolar geometry of the Crossed-Slits projection , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[33]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[34]  Tanveer F. Syeda-Mahmood,et al.  View-invariant alignment and matching of video sequences , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[35]  Joonki Paik,et al.  Color active shape models for tracking non-rigid objects , 2003, Pattern Recognit. Lett..

[36]  Rama Chellappa,et al.  View invariants for human action recognition , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[37]  Takeo Kanade,et al.  Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[38]  Daphna Weinshall,et al.  Mosaicing New Views: The Crossed-Slits Projection , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Cristian Sminchisescu,et al.  Estimating Articulated Human Motion with Covariance Scaled Sampling , 2003, Int. J. Robotics Res..

[40]  Hilary Buxton,et al.  Learning and understanding dynamic scene activity: a review , 2003, Image Vis. Comput..

[41]  Eric Horvitz,et al.  Layered representations for human activity recognition , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[42]  Adrien Bartoli The Geometry of Dynamic Scenes - On Coplanar and Convergent Linear Motions Embedded in 3D Static Scenes , 2002, BMVC.

[43]  Rama Chellappa,et al.  Quasi-invariants for human action representation and recognition , 2002, Object recognition supported by user interaction for service robots.

[44]  D. Weinshall,et al.  New View Generation with a Bi-centric Camera , 2002, ECCV.

[45]  Peter F. Sturm Structure and Motion for Dynamic Scenes - The Case of Points Moving in Planes , 2002, ECCV.

[46]  Cordelia Schmid,et al.  Learning to Parse Pictures of People , 2002, ECCV.

[47]  Lior Wolf,et al.  Sequence-to-Sequence Self Calibration , 2002, ECCV.

[48]  Pascal Fua,et al.  Model-Based Silhouette Extraction for Accurate People Tracking , 2002, ECCV.

[49]  Gary R. Bradski,et al.  Motion segmentation and pose recognition with motion history gradients , 2002, Machine Vision and Applications.

[50]  Mubarak Shah,et al.  View-invariance in action recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[51]  Lihi Zelnik-Manor,et al.  Event-based analysis of video , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[52]  J. Decety,et al.  From the perception of action to the understanding of intention , 2001, Nature reviews. Neuroscience.

[53]  Thomas B. Moeslund,et al.  A Survey of Computer Vision-Based Human Motion Capture , 2001, Comput. Vis. Image Underst..

[54]  David C. Hogg,et al.  Learning Variable-Length Markov Models of Behavior , 2001, Comput. Vis. Image Underst..

[55]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[56]  Larry S. Davis,et al.  W4: Real-Time Surveillance of People and Their Activities , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[58]  Yaron Caspi,et al.  A step towards sequence-to-sequence alignment , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[59]  Henning Biermann,et al.  Recovering non-rigid 3D shape from image streams , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[60]  Amnon Shashua,et al.  Trajectory Triangulation: 3D Reconstruction of Moving Points from a Monocular Image Sequence , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[61]  James J. Callahan,et al.  The geometry of spacetime , 2000 .

[62]  K. Verfaillie,et al.  Viewpoint-dependent Priming Effects in the Perception of Human Actions and Body Postures , 1999 .

[63]  J. Decety,et al.  Neural mechanisms subserving the perception of human actions , 1999, Trends in Cognitive Sciences.

[64]  Jake K. Aggarwal,et al.  Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[65]  B. Farnell Moving Bodies, Acting Selves , 1999 .

[66]  O. Faugeras,et al.  The Geometry of Multiple Images , 1999 .

[67]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[68]  Mubarak Shah,et al.  Recognizing human actions in a static room , 1998, Proceedings Fourth IEEE Workshop on Applications of Computer Vision. WACV'98 (Cat. No.98EX201).

[69]  Aaron F. Bobick,et al.  Action recognition using probabilistic parsing , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[70]  Narendra Ahuja,et al.  Extracting gestural motion trajectories , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[71]  Fumio Miyazaki,et al.  Description and recognition of human gestures based on the transition of curvature from motion images , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[72]  V. M. Zat︠s︡iorskiĭ Kinematics of human motion , 1998 .

[73]  W. Prinz Perception and Action Planning , 1997 .

[74]  Yangsheng Xu,et al.  Human action learning via hidden Markov model , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[75]  James W. Davis,et al.  The Representation and Recognition of Action Using Temporal Templates , 1997, CVPR 1997.

[76]  Alex Pentland,et al.  Task-Specific Gesture Analysis in Real-Time Using Interpolated Views , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[77]  Michael J. Black,et al.  Cardboard people: a parameterized model of articulated image motion , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[78]  G. Rizzolatti,et al.  Action recognition in the premotor cortex. , 1996, Brain : a journal of neurology.

[79]  Michael J. Black,et al.  Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion , 1995, Proceedings of IEEE International Conference on Computer Vision.

[80]  Mubarak Shah,et al.  Motion-based recognition a survey , 1995, Image Vis. Comput..

[81]  Jake K. Aggarwal,et al.  Articulated and elastic non-rigid motion: a review , 1994, Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects.

[82]  Edward H. Adelson,et al.  Analyzing and recognizing walking figures in XYT , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[83]  Rajiv Gupta,et al.  Linear Pushbroom Cameras , 1994, ECCV.

[84]  Randal C. Nelson,et al.  Detecting activities , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[85]  Edward M. Riseman,et al.  The non-existence of general-case view-invariants , 1992 .

[86]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[87]  K. Verfaillie Variant points of view on viewpoint invariance. , 1992, Canadian journal of psychology.

[88]  Olivier D. Faugeras,et al.  What can be seen in three dimensions with an uncalibrated stereo rig , 1992, ECCV.

[89]  Richard I. Hartley,et al.  Estimation of Relative Camera Positions for Uncalibrated Cameras , 1992, ECCV.

[90]  Mubarak Shah,et al.  The trajectory primal sketch: a multi-scale scheme for representing motion characteristics , 1989, Proceedings CVPR '89: IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[91]  Koichiro Akita,et al.  Image sequence analysis of real world human motion , 1984, Pattern Recognit..

[92]  Robert W. Bailey,et al.  Human performance engineering: A guide for system designers , 1982 .

[93]  Karl H.E. Kroemer,et al.  Anthropometry and biomechanics : theory and application , 1982 .

[94]  R. F. Rashid,et al.  Towards a system for the interpretation of moving light displays , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  J. O'Rourke,et al.  Model-based image analysis of human motion using constraint propagation , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[96]  Martin Herman,et al.  Understanding body postures of human stick figures , 1979 .

[97]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[98]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[99]  A. Goldman Theory of Human Action , 1970 .

[100]  Of references. , 1966, JAMA.

[101]  L. Mises,et al.  Human Action: A Treatise on Economics , 1949 .