Event Mining in Multimedia Streams

Events are real-world occurrences that unfold over space and time. Event mining from multimedia streams improves the access and reuse of large media collections, and it has been an active area of research with notable progress. This paper contains a survey on the problems and solutions in event mining, approached from three aspects: event description, event-modeling components, and current event mining systems. We present a general characterization of multimedia events, motivated by the maxim of five ldquoWrdquos and one ldquoHrdquo for reporting real-world events in journalism: when, where, who, what, why, and how. We discuss the causes for semantic variability in real-world descriptions, including multilevel event semantics, implicit semantics facets, and the influence of context. We discuss five main aspects of an event detection system. These aspects are: the variants of tasks and event definitions that constrain system design, the media capture setup that collectively define the available data and necessary domain assumptions, the feature extraction step that converts the captured data into perceptually significant numeric or symbolic forms, statistical models that map the feature representations to richer semantic descriptions, and applications that use event metadata to help in different information-seeking tasks. We review current event-mining systems in detail, grouping them by the problem formulations and approaches. The review includes detection of events and actions in one or more continuous sequences, events in edited video streams, unsupervised event discovery, events in a collection of media objects, and a discussion on ongoing benchmark activities. These problems span a wide range of multimedia domains such as surveillance, meetings, broadcast news, sports, documentary, and films, as well as personal and online media collections. We conclude this survey with a brief outlook on open research directions.

[1]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[3]  Mor Naaman,et al.  Why we tag: motivations for annotation in mobile and online media , 2007, CHI.

[4]  Saul Greenberg,et al.  Context as a Dynamic Construct , 2001, Hum. Comput. Interact..

[5]  Ramakant Nevatia,et al.  Representation and optimal recognition of human activities , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[6]  Abigail Sellen,et al.  Understanding videowork , 2007, CHI.

[7]  Mubarak Shah,et al.  View-Invariant Representation and Recognition of Actions , 2002, International Journal of Computer Vision.

[8]  Roland T. Chin,et al.  On image analysis by the methods of moments , 1988, Proceedings CVPR '88: The Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[10]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[11]  Andreas Paepcke,et al.  Time as essence for photo browsing through personal digital libraries , 2002, JCDL '02.

[12]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[13]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[14]  Qi Tian,et al.  A mid-level representation framework for semantic sports video analysis , 2003, ACM Multimedia.

[15]  Patrick Brézillon,et al.  About Some Relationships between Knowledge and Context , 2001, CONTEXT.

[16]  Deb Roy,et al.  Situated Models of Meaning for Sports Video Retrieval , 2007, NAACL.

[17]  Andreas E. Savakis,et al.  Automated event clustering and quality screening of consumer pictures for digital albuming , 2003, IEEE Trans. Multim..

[18]  Roni Rosenfeld,et al.  A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[19]  C. Machy,et al.  Performance Evaluation of Frequent Events Detection Systems , 2006 .

[20]  Ramesh Jain,et al.  Toward a Common Event Model for Multimedia Applications , 2007, IEEE MultiMedia.

[21]  Datong Chen,et al.  Multimodal detection of human interaction events in a nursing home environment , 2004, ICMI '04.

[22]  Shih-Fu Chang,et al.  Overview of the MPEG-7 standard , 2001, IEEE Trans. Circuits Syst. Video Technol..

[23]  David C. Minnen,et al.  Propagation networks for recognition of partially ordered sequential action , 2004, CVPR 2004.

[24]  Michael J. Witbrock,et al.  Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[25]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  R. Brunelli,et al.  A Survey on the Automatic Indexing of Video Data, , 1999, J. Vis. Commun. Image Represent..

[27]  Tsuhan Chen,et al.  Audio Feature Extraction and Analysis for Scene Segmentation and Classification , 1998, J. VLSI Signal Process..

[28]  Rama Chellappa,et al.  Attribute Grammar-Based Event Recognition and Anomaly Detection , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[29]  Shih-Fu Chang,et al.  Determining computable scenes in films and their structures using audio-visual memory models , 2000, ACM Multimedia.

[30]  Rong Yan,et al.  A review of text and image retrieval approaches for broadcast news video , 2007, Information Retrieval.

[31]  Wei-Yun Yau,et al.  An automatic drowning detection surveillance system for challenging outdoor pool environments , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[32]  Mor Naaman,et al.  Context data in geo-referenced digital photo collections , 2004, MULTIMEDIA '04.

[33]  Lihi Zelnik-Manor,et al.  Event-based analysis of video , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[34]  Lexing Xie,et al.  Contextual wisdom: social relations and correlations for multimedia event annotation , 2007, ACM Multimedia.

[35]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[36]  Jun Yang,et al.  Annotating News Video with Locations , 2006, CIVR.

[37]  Shih-Fu Chang,et al.  Pattern Mining in Visual Concept Streams , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[38]  David C. Gibbon,et al.  A Fast, Comprehensive Shot Boundary Determination System , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[39]  Ramakant Nevatia,et al.  VERL: An Ontology Framework for Representing and Annotating Video Events , 2005, IEEE Multim..

[40]  Jing Huang,et al.  Spatial Color Indexing and Applications , 2004, International Journal of Computer Vision.

[41]  Rama Chellappa,et al.  From Videos to Verbs: Mining Videos for Activities using a Cascade of Dynamical Systems , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Gregory D. Abowd,et al.  Towards a Better Understanding of Context and Context-Awareness , 1999, HUC.

[43]  Ramesh C. Jain,et al.  Recursive identification of gesture inputs using hidden Markov models , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[44]  Hari Sundaram,et al.  Modeling user context with applications to media retrieval , 2006, Multimedia Systems.

[45]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[46]  Rainer Stiefelhagen,et al.  Computers in the Human Interaction Loop , 2009, Human-Computer Interaction Series.

[47]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[48]  Marcel Worring,et al.  Interactive Adaptive Movie Annotation , 2003, IEEE Multim..

[49]  Lisa M. Brown,et al.  Specifying, Interpreting and Detecting High-level, Spatio-Temporal Composite Events in Single and Multi-Camera Systems , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[50]  Chin-Hui Lee,et al.  A Multi-Modal Approach to Story Segmentation for News Video , 2003, World Wide Web.

[51]  Mor Naaman,et al.  How flickr helps us make sense of the world: context and content in community-contributed media collections , 2007, ACM Multimedia.

[52]  Alex Pentland,et al.  Unsupervised clustering of ambulatory audio and video , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[53]  Yan Huang,et al.  Propagation networks for recognition of partially ordered sequential action , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[54]  Shih-Fu Chang,et al.  Unsupervised Mining of Statistical Temporal Structures in Video , 2003 .

[55]  Mary Czerwinski,et al.  PhotoTOC: automatic clustering for browsing personal photographs , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[56]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[57]  A. Murat Tekalp,et al.  Automatic Soccer Video Analysis and Summarization , 2003, IS&T/SPIE Electronic Imaging.

[58]  Marc Davis,et al.  Editing out Video Editing , 2003, IEEE Multim..

[59]  John S. Boreczky,et al.  Comparison of video shot boundary detection techniques , 1996, J. Electronic Imaging.

[60]  Graham Coleman,et al.  Detection and explanation of anomalous activities: representing activities as bags of event n-grams , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[61]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[63]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[64]  Ullas Gargi,et al.  Modeling and clustering of photo capture streams , 2003, MIR '03.

[65]  Shih-Fu Chang,et al.  Story boundary detection in large broadcast news video archives: techniques, experience and trends , 2004, MULTIMEDIA '04.

[66]  Regunathan Radhakrishnan,et al.  A time series clustering based framework for multimedia mining and summarization using audio features , 2004, MIR '04.

[67]  Andrew Zisserman,et al.  Automatic face recognition for film character retrieval in feature-length films , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[68]  Witold Pedrycz,et al.  Face recognition: A study in information fusion using fuzzy integral , 2005, Pattern Recognit. Lett..

[69]  Linda G. Shapiro,et al.  Computer Vision , 2001 .

[70]  Stefanie Tellex,et al.  The Human Speechome Project , 2006, EELC.

[71]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[72]  B. Li,et al.  Event detection and summarization in sports video , 2001, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL 2001).

[73]  Dong Xu,et al.  Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[74]  Mor Naaman,et al.  Leveraging context to resolve identity in photo albums , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[75]  Atreyi Kankanhalli,et al.  Automatic partitioning of full-motion video , 1993, Multimedia Systems.

[76]  L. Wood,et al.  From the Authors , 2003, European Respiratory Journal.

[77]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[78]  Z. M. Hefed Object tracking , 1999 .

[79]  Wayne H. Wolf,et al.  Smart Cameras as Embedded Systems , 2002, Computer.

[80]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[81]  Qi Tian,et al.  Trajectory-based ball detection and tracking with applications to semantic analysis of broadcast soccer video , 2003, MULTIMEDIA '03.

[82]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[83]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[84]  Ramakant Nevatia,et al.  Hierarchical Language-based Representation of Events in Video Streams , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[85]  Terry Winograd,et al.  Architectures for Context , 2001, Hum. Comput. Interact..

[86]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[87]  Alberto Del Bimbo,et al.  Diversity in multimedia information retrieval research , 2006, MIR '06.

[88]  Bo Zhang,et al.  A Formal Study of Shot Boundary Detection , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[89]  Jin Zhao,et al.  Video Retrieval Using High Level Features: Exploiting Query Matching and Confidence-Based Weighting , 2006, CIVR.

[90]  Ramakant Nevatia,et al.  Multi-agent event recognition , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[91]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[92]  Erik Hjelmås,et al.  Face Detection: A Survey , 2001, Comput. Vis. Image Underst..

[93]  Abigail Sellen,et al.  Understanding photowork , 2006, CHI.

[94]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[95]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[96]  Gary C. Borchardt,et al.  Event Calculus , 1985, IJCAI.

[97]  M. Alex O. Vasilescu,et al.  Recognizing action events from multiple viewpoints , 2001, Proceedings IEEE Workshop on Detection and Recognition of Events in Video.

[98]  Shih-Fu Chang,et al.  The holy grail of content-based media analysis , 2002 .

[99]  Murray Shanahan,et al.  The Event Calculus Explained , 1999, Artificial Intelligence Today.

[100]  Joo-Hwee Lim,et al.  Home Photo Content Modeling for Personalized Event-Based Retrieval , 2003, IEEE Multim..

[101]  Frank Nack Capture and transfer of metadata during video production , 2005, MHC '05.

[102]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[103]  Hanning Zhou,et al.  Unusual Event Detection via Multi-camera Video Mining , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[104]  Deb Roy,et al.  Mining temporal patterns of movement for video content classification , 2006, MIR '06.

[105]  Rama Chellappa,et al.  View invariants for human action recognition , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[106]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[107]  Larry S. Davis,et al.  Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[108]  David Bordwell,et al.  Film Art: An Introduction , 1979 .

[109]  Shih-Fu Chang,et al.  Discovery and fusion of salient multimodal features toward news story segmentation , 2003, IS&T/SPIE Electronic Imaging.

[110]  Valery A. Petrushin Mining rare and frequent events in multi-camera surveillance video using self-organizing maps , 2005, KDD '05.

[111]  Ramakant Nevatia,et al.  Event Detection and Analysis from Video Streams , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[112]  Mor Naaman,et al.  Towards automatic extraction of event and place semantics from flickr tags , 2007, SIGIR.

[113]  Dong Wang,et al.  Video search in concept subspace: a text-like paradigm , 2007, CIVR '07.

[114]  Rong Yan,et al.  Extracting Semantics from Multimedia Content: Challenges and Solutions , 2008 .

[115]  Y. J. Tejwani,et al.  Robot vision , 1989, IEEE International Symposium on Circuits and Systems,.

[116]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[117]  Shahram Ebadollahi,et al.  Visual Event Detection using Multi-Dimensional Concept Dynamics , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[118]  Mor Naaman,et al.  World explorer: visualizing aggregate data from unstructured text in geo-referenced collections , 2007, JCDL '07.

[119]  Xindong Wu,et al.  Video data mining: semantic indexing and event detection from the association perspective , 2005, IEEE Transactions on Knowledge and Data Engineering.

[120]  Shih-Fu Chang,et al.  Structure analysis of soccer video with hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[121]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[122]  William D Marslen-Wilson,et al.  Processing interactions and lexical access during word recognition in continuous speech , 1978, Cognitive Psychology.

[123]  Tony Jebara,et al.  Machine learning: Discriminative and generative , 2006 .

[124]  Jianbo Shi,et al.  Detecting unusual activity in video , 2004, CVPR 2004.

[125]  Rachel Bowers,et al.  PETS vs . VACE Evaluation Programs : A Comparative Study , 2006 .

[126]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[127]  Roland T. Chin,et al.  On Image Analysis by the Methods of Moments , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[128]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[129]  Andreas Girgensohn,et al.  Temporal event clustering for digital photo collections , 2003, ACM Multimedia.

[130]  John R. Kender,et al.  Video scene segmentation via continuous video coherence , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[131]  Marcel Worring,et al.  Adding Semantics to Detectors for Video Retrieval , 2007, IEEE Transactions on Multimedia.

[132]  Donna K. Harman,et al.  The Text REtrieval Conference (TREC) , 1999, NTCIR.

[133]  James W. Davis,et al.  The Representation and Recognition of Action Using Temporal Templates , 1997, CVPR 1997.

[134]  Paul Dourish,et al.  What we talk about when we talk about context , 2004, Personal and Ubiquitous Computing.

[135]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[136]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[137]  Dima Damen,et al.  Recognizing linked events: Searching the space of feasible explanations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[138]  Irfan A. Essa,et al.  Structure from Statistics - Unsupervised Activity Analysis using Suffix Trees , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[139]  Hari Sundaram,et al.  Networked multimedia event exploration , 2004, MULTIMEDIA '04.

[140]  Shih-Fu Chang,et al.  Discovering meaningful multimedia patterns with audio-visual concepts and associated text , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[141]  Regunathan Radhakrishnan,et al.  Audio events detection based highlights extraction from baseball, golf and soccer games in a unified framework , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[142]  Kerry Rodden,et al.  How do people manage their digital photographs? , 2003, CHI '03.

[143]  Azriel Rosenfeld,et al.  Computer Vision , 1988, Adv. Comput..

[144]  Rong Yan,et al.  Semantic concept-based query expansion and re-ranking for multimedia retrieval , 2007, ACM Multimedia.

[145]  TomasiCarlo,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000 .

[146]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[147]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[148]  J. Markel,et al.  The SIFT algorithm for fundamental frequency estimation , 1972 .

[149]  Samy Bengio,et al.  Semi-supervised adapted HMMs for unusual event detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[150]  Chen Wu,et al.  Layered and Collaborative Gesture Analysis in Multi-Camera Networks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[151]  Yee Whye Teh,et al.  Names and faces in the news , 2004, CVPR 2004.

[152]  Mubarak Shah,et al.  Motion-based recognition a survey , 1995, Image Vis. Comput..

[153]  Daniel P. W. Ellis,et al.  Accessing Minimal-Impact Personal Audio Archives , 2006, IEEE MultiMedia.

[154]  Yee Whye Teh,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[155]  Jiebo Luo,et al.  Large-scale multimodal semantic concept detection for consumer video , 2007, MIR '07.

[156]  Stefan Sharff The Elements of Cinema: Toward a Theory of Cinesthetic Impact , 1982 .

[157]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[158]  Simon King,et al.  From context to content: leveraging context to infer media metadata , 2004, MULTIMEDIA '04.

[159]  Lexing Xie,et al.  Modeling personal and social network context for event annotation in images , 2007, JCDL '07.

[160]  Alberto Del Bimbo,et al.  Taking into Consideration Sports Semantic Annotation of Sports Videos Content-based Multimedia Indexing and Retrieval , 2002 .

[161]  Yaser Sheikh,et al.  CASEE: A Hierarchical Event Representation for the Analysis of Videos , 2004, AAAI.

[162]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.