Event detection and recognition for semantic annotation of video

Research on methods for detection and recognition of events and actions in videos is receiving an increasing attention from the scientific community, because of its relevance for many applications, from semantic video indexing to intelligent video surveillance systems and advanced human-computer interaction interfaces. Event detection and recognition requires to consider the temporal aspect of video, either at the low-level with appropriate features, or at a higher-level with models and classifiers than can represent time. In this paper we survey the field of event recognition, from interest point detectors and descriptors, to event modelling techniques and knowledge management technologies. We provide an overview of the methods, categorising them according to video production methods and video domains, and according to types of events and actions that are typical of these domains.

[1]  Tae-Kyun Kim,et al.  Learning Motion Categories using both Semantic and Structural Information , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Larry S. Davis,et al.  VidMAP: video monitoring of activity with Prolog , 2005, IEEE Conference on Advanced Video and Signal Based Surveillance, 2005..

[3]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[4]  Jun Yang,et al.  Exploring temporal consistency for video analysis and retrieval , 2006, MIR '06.

[5]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  I. Patras,et al.  Spatiotemporal salient points for visual recognition of human actions , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[7]  Shahram Ebadollahi,et al.  Visual Event Detection using Multi-Dimensional Concept Dynamics , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[8]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[9]  Mubarak Shah,et al.  A Streakline Representation of Flow in Crowded Scenes , 2010, ECCV.

[10]  Rita Cucchiara,et al.  Video Surveillance Online Repository (ViSOR): an integrated framework , 2010, Multimedia Tools and Applications.

[11]  Thomas B. Moeslund,et al.  Motion Primitives for Action Recognition , 2007 .

[12]  Datong Chen,et al.  Towards automatic analysis of social interaction patterns in a nursing home environment from video , 2004, MIR '04.

[13]  Steffen Staab,et al.  Introducing Context and Reasoning in Visual Content Analysis: An Ontology-Based Framework , 2008 .

[14]  Teddy Ko,et al.  A survey on behavior analysis in video surveillance for homeland security applications , 2008, 2008 37th IEEE Applied Imagery Pattern Recognition Workshop.

[15]  Ramakant Nevatia,et al.  VERL: An Ontology Framework for Representing and Annotating Video Events , 2005, IEEE Multim..

[16]  Ramakant Nevatia,et al.  An Ontology for Video Event Representation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[17]  Mubarak Shah,et al.  Content based video matching using spatiotemporal volumes , 2008, Comput. Vis. Image Underst..

[18]  Milind R. Naphade,et al.  Classification of video events using 4-dimensional time-compressed motion features , 2007, CIVR '07.

[19]  Shih-Fu Chang,et al.  Revision of LSCOM Event/Activity Annotations , 2006 .

[20]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[22]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[23]  Zhang Hai-ling Semantic Integration and Retrieval of Multimedia Metadata , 2007 .

[24]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[25]  Michael Brady,et al.  Saliency, Scale and Image Description , 2001, International Journal of Computer Vision.

[26]  Alberto Del Bimbo,et al.  Semantic annotation of soccer videos: automatic highlights identification , 2003, Comput. Vis. Image Underst..

[27]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[28]  Naomi Harte,et al.  On Parsing Visual Sequences with the Hidden Markov Model , 2009, EURASIP J. Image Video Process..

[29]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[30]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Christophe Dousson,et al.  Chronicle Recognition Improvement Using Temporal Focusing and Hierarchization , 2007, IJCAI.

[33]  Ramesh C. Jain,et al.  Annotation of paintings with high-level semantic concepts using transductive inference and ontology-based concept disambiguation , 2007, ACM Multimedia.

[34]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[35]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[36]  Nuno Vasconcelos,et al.  Anomaly detection in crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[37]  M. Luo,et al.  Pyramidwise structuring for soccer highlight extraction , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[38]  Thomas Serre,et al.  Automated home-cage behavioural phenotyping of mice. , 2010, Nature communications.

[39]  Juan Carlos SanMiguel,et al.  An Ontology for Event Detection and its Application in Surveillance Video , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[40]  Chong-Wah Ngo,et al.  Video event detection using motion relativity and visual relatedness , 2008, ACM Multimedia.

[41]  Ehud Rivlin,et al.  Building Petri Nets from Video Event Ontologies , 2007, ISVC.

[42]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[43]  Michael G. Strintzis,et al.  Knowledge-assisted semantic video object detection , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[44]  Monique Thonnat,et al.  Ontology based complex object recognition , 2008, Image Vis. Comput..

[45]  Alexander Artikis,et al.  A logic programming approach to activity recognition , 2009, EiMM '10.

[46]  John A. Miller,et al.  Simulation and the semantic Web , 2005, Proceedings of the Winter Simulation Conference, 2005..

[47]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[48]  Nicolas Courty,et al.  Gesture in Human-Computer Interaction and Simulation , 2006 .

[49]  Gian Luca Foresti,et al.  Domain knowledge for surveillance applications , 2007, 2007 10th International Conference on Information Fusion.

[50]  Paul Over,et al.  TRECVID 2008 - Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2010, TRECVID.

[51]  Álvaro García-Martín,et al.  An Ontology for Event Detection and its Application in Surveillance Video , 2009, AVSS.

[52]  Martin Bichler,et al.  Knowledge representation concepts for automated SLA management , 2006, Decis. Support Syst..

[53]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[54]  Bernd Neumann,et al.  On scene interpretation with description logics , 2006, Image Vis. Comput..

[55]  Dong Xu,et al.  Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[57]  Rama Chellappa,et al.  Identification of humans using gait , 2004, IEEE Transactions on Image Processing.

[58]  Chung-Lin Huang,et al.  Semantics-based highlight extraction of soccer program using DBN , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[59]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Bernhard Schölkopf,et al.  How to Find Interesting Locations in Video: A Spatiotemporal Interest Point Detector Learned from Human Eye Movements , 2007, DAGM-Symposium.

[61]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[62]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[63]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[64]  Aisling Kelliher,et al.  Eventory -- An Event Based Media Repository , 2007, International Conference on Semantic Computing (ICSC 2007).

[65]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[67]  Chrisa Tsinaraki,et al.  Ontology-Based Semantic Indexing for MPEG-7 and TV-Anytime Audiovisual Content , 2005, Multimedia Tools and Applications.

[68]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[69]  Marco Bertini,et al.  Non-parametric anomaly detection exploiting space-time features , 2010, ACM Multimedia.

[70]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[71]  Alberto Del Bimbo,et al.  Video event classification using string kernels , 2010, Multimedia Tools and Applications.

[72]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[73]  Alberto Del Bimbo,et al.  Learning ontology rules for semantic video annotation , 2008, MS '08.

[74]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[75]  Steffen Staab,et al.  Semantic Annotation of Images and Videos for Multimedia Analysis , 2005, ESWC.

[76]  Larry S. Davis,et al.  Event Modeling and Recognition Using Markov Logic Networks , 2008, ECCV.

[77]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[78]  Alberto Del Bimbo,et al.  Dynamic pictorial ontologies for video digital libraries annotation , 2007, MS '07.

[79]  Alberto Del Bimbo,et al.  Video Annotation and Retrieval Using Ontologies and Rule Learning , 2010, IEEE MultiMedia.

[80]  Lao Songyang Video Semantic Content Analysis Based on Ontology , 2009 .

[81]  Roberto Cipolla,et al.  Extracting Spatiotemporal Interest Points using Global Information , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[82]  Luc Van Gool,et al.  What's going on? Discovering spatio-temporal dependencies in dynamic scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[83]  Alberto Del Bimbo,et al.  Soccer highlights detection and recognition using HMMs , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[84]  Rama Chellappa,et al.  An ontology based approach for activity recognition from video , 2008, ACM Multimedia.

[85]  Alberto Del Bimbo,et al.  Common Visual Cues for Sports Highlights Modeling , 2005, Multimedia Tools and Applications.

[86]  Sergio A. Velastin,et al.  Crowd analysis: a survey , 2008, Machine Vision and Applications.

[87]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[88]  Yiannis Kompatsiaris,et al.  Semantic Multimedia and Ontologies: Theory and Applications , 2008 .

[89]  Noel E. O'Connor,et al.  Event detection in field sports video using audio-visual features and a support vector Machine , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[90]  Steffen Staab,et al.  F--a model of events based on the foundational ontology dolce+DnS ultralight , 2009, K-CAP '09.

[91]  Matthew Brand,et al.  Discovery and Segmentation of Activities in Video , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[92]  Alberto Del Bimbo,et al.  Semantic annotation of soccer videos by visual instance clustering and spatial/temporal reasoning in ontologies , 2010, Multimedia Tools and Applications.

[93]  Shuicheng Yan,et al.  SIFT-Bag kernel for video event analysis , 2008, ACM Multimedia.

[94]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[95]  Roberto García,et al.  Semantic Integration and Retrieval of Multimedia Metadata , 2005, SemAnnot@ISWC.

[96]  Ehud Rivlin,et al.  Understanding Video Events: A Survey of Methods for Automatic Interpretation of Semantic Occurrences in Video , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[97]  Marek J. Sergot,et al.  A logic-based calculus of events , 1989, New Generation Computing.

[98]  Songyang Lao,et al.  Video Semantic Content Analysis based on Ontology , 2007, International Machine Vision and Image Processing Conference (IMVIP 2007).

[99]  Weiming Zhang,et al.  A Semantic Event Detection Approach for Soccer Video based on Perception Concepts and Finiste State Machines , 2007, Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS '07).

[100]  Krystian Mikolajczyk,et al.  Action recognition with motion-appearance vocabulary forest , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[101]  Steffen Staab,et al.  COMM: Designing a Well-Founded Multimedia Ontology for the Web , 2007, ISWC/ASWC.

[102]  G. Miller,et al.  A Semantic Network of English Verbs , 1998 .

[103]  Mubarak Shah,et al.  Ontology and taxonomy collaborated framework for meeting classification , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[104]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[105]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[106]  Alexander Hauptmann,et al.  Informedia @ TRECVID2009: Analyzing Video Motions , 2009, TRECVID.

[107]  Monique Thonnat,et al.  A video interpretation platform applied to bank agency monitoring , 2004 .

[108]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[109]  Gang Hua,et al.  Picking the best DAISY , 2009, CVPR.

[110]  Jane Hunter,et al.  Evaluating the application of semantic inferencing rules to image annotation , 2005, K-CAP '05.

[111]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[112]  Silvio Savarese,et al.  Discriminative Object Class Models of Appearance and Shape by Correlatons , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[113]  Adam Jatowt,et al.  Enhancing Comprehension of Events in Video Through Explanation-on-Demand Hypervideo , 2007, MMM.

[114]  Diane J. Cook,et al.  Automatic Video Classification: A Survey of the Literature , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[115]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[116]  Alberto Del Bimbo,et al.  Recognizing human actions by fusing spatio-temporal appearance and motion descriptors , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).