Towards High-Level Human Activity Recognition through Computer Vision and Temporal Logic

Most approaches to the visual perception of humans do not include high-level activity recognitition. This paper presents a system that fuses and interprets the outputs of several computer vision components as well as speech recognition to obtain a high-level understanding of the perceived scene. Our laboratory for investigating new ways of human-machine interaction and teamwork support, is equipped with an assemblage of cameras, some close-talking microphones, and a videowall as main interaction device. Here, we develop state of the art real-time computer vision systems to track and identify users, and estimate their visual focus of attention and gesture activity. We also monitor the users' speech activity in real time. This paper explains our approach to highlevel activity recognition based on these perceptual components and a temporal logic engine.

[1]  James F. Allen,et al.  Actions and Events in Interval Temporal Logic , 1994, J. Log. Comput..

[2]  Jian Lu,et al.  epSICAR: An Emerging Patterns based approach to sequential, interleaved and Concurrent Activity Recognition , 2009, 2009 IEEE International Conference on Pervasive Computing and Communications.

[3]  Roshan Naik roshan Blending the Logic Paradigm into C + + , 2010 .

[4]  Rainer Stiefelhagen,et al.  Multi-level Particle Filter Fusion of Features and Cues for Audio-Visual Person Tracking , 2007, CLEAR.

[5]  Joris IJsselmuiden,et al.  Extending touch: towards interaction with large-scale surfaces , 2009, ITS '09.

[6]  Jake K. Aggarwal,et al.  Semantic Representation and Recognition of Continued and Recursive Human Activities , 2009, International Journal of Computer Vision.

[7]  Andrei Popescu-Belis,et al.  Multimodal Signal Processing , 2012 .

[8]  Juan Carlos Augusto,et al.  Handbook of Ambient Intelligence and Smart Environments , 2009, HAIS 2010.

[9]  Toni Iverg√•rd,et al.  Handbook of Control Room Design and Ergonomics: A Perspective for the Future, Second Edition , 1989 .

[10]  Jonathan G. Fiscus,et al.  Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers , 2008, CLEAR.

[11]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[12]  Qin Jin,et al.  ISL Person Identification Systems in the CLEAR 2007 Evaluations , 2007, CLEAR.

[13]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Rainer Stiefelhagen,et al.  Computers in the Human Interaction Loop , 2009, Human-Computer Interaction Series.

[15]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[16]  H.-H. Nagel,et al.  Representation of occurrences for road vehicle traffic , 2008, Artif. Intell..

[17]  Toni Ivergård,et al.  Handbook of control room design and ergonomics , 1989 .

[18]  Rainer Stiefelhagen,et al.  Deducing the visual focus of attention from head pose estimation in dynamic multi-view meeting scenarios , 2008, ICMI '08.

[19]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[20]  Oliver Brdiczka,et al.  Detecting Human Behavior Models From Multimodal Observation in a Smart Home , 2009, IEEE Transactions on Automation Science and Engineering.

[21]  Jean-Philippe Thiran,et al.  Multimodal Signal Processing: Theory and applications for human-computer interaction , 2009 .

[22]  F. Xavier Roca,et al.  Understanding dynamic scenes based on human sequence evaluation , 2009, Image Vis. Comput..