Multimodal Video Indexing : A Review of the State-ofthe-art

Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. In this paper we survey several methods aiming at automating this time and resource consuming process. Good reviews on single modality based video indexing have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. Therefore, instead of separately treating the different information sources involved, and their specific algorithms, we focus on the similarities and differences between the modalities. To that end we put forward a unifying and multimodal framework, which views a video document from the perspective of its author. This framework forms the guiding principle for identifying index types, for which automatic methods are found in literature. It furthermore forms the basis for categorizing these different methods.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3]  Ali N. Akansu,et al.  Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing , 2001, Multimedia Tools and Applications.

[4]  Alberto Del Bimbo,et al.  Content-based indexing and retrieval of TV news , 2001, Pattern Recognit. Lett..

[5]  Dominique Barba,et al.  Spatio-temporal segmentation of image sequences for object-oriented low bit-rate image coding , 1996, Signal Process. Image Commun..

[6]  S. Sclaroff,et al.  Combining textual and visual cues for content-based image retrieval on the World Wide Web , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[7]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[8]  Zhu Liu,et al.  Integration of multimodal features for video scene classification based on HMM , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[9]  Ramesh C. Jain,et al.  Metadata in video databases , 1994, SGMD.

[10]  Jeho Nam,et al.  Speaker identification and video analysis for hierarchical video shot classification , 1997, Proceedings of International Conference on Image Processing.

[11]  A. Murat Tekalp,et al.  Region-Based Parametric Motion Segmentation Using Color Information , 1998, Graph. Model. Image Process..

[12]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[13]  Michael G. Christel,et al.  Interactive maps for a digital video library , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[14]  Tomaso A. Poggio,et al.  Example-Based Object Detection in Images by Components , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Jeho Nam,et al.  Audio-visual content-based violent scene characterization , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[16]  Wolfgang Effelsberg,et al.  Automatic recognition of film genres , 1995, MULTIMEDIA '95.

[17]  Svetha Venkatesh,et al.  Detecting indexical signs in film audio for scene interpretation , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[18]  Ali N. Akansu,et al.  Low-level motion activity features for semantic characterization of video , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[19]  Sanjeev R. Kulkarni,et al.  Automated analysis and annotation of basketball video , 1997, Electronic Imaging.

[20]  Alberto Del Bimbo,et al.  Semantics in Visual Information Retrieval , 1999, IEEE Multim..

[21]  Ba Tu Truong,et al.  Determining dramatic intensification via flashing lights in movies , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[22]  R. Brunelli,et al.  A Survey on the Automatic Indexing of Video Data, , 1999, J. Vis. Commun. Image Represent..

[23]  Karen Spärck Jones,et al.  Automatic content-based retrieval of broadcast news , 1995, MULTIMEDIA '95.

[24]  Anil K. Jain,et al.  Automatic classification of tennis video for high-level content-based retrieval , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[25]  Hisashi Miyamori,et al.  Video annotation for content-based retrieval using human behavior analysis and domain knowledge , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[26]  Boon-Lock Yeo,et al.  Video query: Research directions , 1998, IBM J. Res. Dev..

[27]  Ishwar K. Sethi,et al.  Classification of general audio data for content-based retrieval , 2001, Pattern Recognit. Lett..

[28]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[29]  Steven Abney,et al.  Part-of-Speech Tagging and Partial Parsing , 1997 .

[30]  Ramesh C. Jain,et al.  Feature Based Digital Video Indexing , 1997, VDB.

[31]  Joëlle Coutaz,et al.  A design space for multimodal systems: concurrent processing and data fusion , 1993, INTERCHI.

[32]  M. Ibrahim Sezan,et al.  A semantic event-detection approach and its application to detecting hunts in wildlife vide , 2000, IEEE Trans. Circuits Syst. Video Technol..

[33]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[35]  Marcel Worring,et al.  Detection of moving objects in video using a robust motion similarity measure , 2000, IEEE Trans. Image Process..

[36]  Wolfgang Effelsberg,et al.  Automatic audio content analysis , 1997, MULTIMEDIA '96.

[37]  HongJiang Zhang,et al.  Automatic parsing of TV soccer programs , 1995, Proceedings of the International Conference on Multimedia Computing and Systems.

[38]  Anil K. Jain,et al.  Detecting sky and vegetation in outdoor images , 1999, Electronic Imaging.

[39]  Wolfgang Effelsberg,et al.  On the detection and recognition of television commercials , 1997, Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[40]  C.-C. Jay Kuo,et al.  Hierarchical classification of audio data for archiving and retrieving , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[41]  C.-C. Jay Kuo,et al.  Rule-based video classification system for basketball video indexing , 2000, MULTIMEDIA '00.

[42]  Alex Pentland,et al.  View-based and modular eigenspaces for face recognition , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[43]  John P. Oakley,et al.  Storage and Retrieval for Image and Video Databases , 1993 .

[44]  Thijs Westerveld,et al.  Image Retrieval: Content versus Context , 2000, RIAO.

[45]  Riccardo Leonardi,et al.  Identification of story units in audio-visual sequences by joint audio and video processing , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[46]  Alan Hanjalic,et al.  DANCERS: Delft advanced news retrieval system , 2001, IS&T/SPIE Electronic Imaging.

[47]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[48]  Nilesh V. Patel,et al.  Audio characterization for video indexing , 1996, Electronic Imaging.

[49]  Alexander G. Hauptmann,et al.  Learning to Recognize Speech by Watching Television , 1999, IEEE Intell. Syst..

[50]  David S. Doermann,et al.  Automatic text detection and tracking in digital video , 2000, IEEE Trans. Image Process..

[51]  Glorianna Davenport,et al.  Cinematic primitives for multimedia , 1991, IEEE Computer Graphics and Applications.

[52]  A. Murat Tekalp,et al.  Video indexing through integration of syntactic and semantic features , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[53]  Brian Christopher Smith,et al.  Query by humming: musical information retrieval in an audio database , 1995, MULTIMEDIA '95.

[54]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[55]  Riccardo Leonardi,et al.  Event recognition in sport programs using low-level motion indices , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[56]  D. Barba,et al.  Spatio-temporal segmentation of image sequences for object-oriented low bit-rate image coding , 1995, Proceedings., International Conference on Image Processing.

[57]  Ichiro Ide,et al.  Automatic Video Indexing Based on Shot Classification , 1998, AMCP.

[58]  Atreyi Kankanhalli,et al.  Automatic partitioning of full-motion video , 1993, Multimedia Systems.

[59]  Steve Young,et al.  Corpus-based methods in language and speech processing , 1997 .

[60]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[61]  Peter J. L. van Beek,et al.  Detection of slow-motion replay segments in sports video for highlights generation , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[62]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[63]  Jan Biemond,et al.  Image and Video Databases: Restoration, Watermarking and Retrieval , 2000 .

[64]  David A. Forsyth,et al.  Finding Naked People , 1996, ECCV.

[65]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[66]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[67]  Alexander G. Hauptmann,et al.  Topic labeling of broadcast news stories in the informedia digital video library , 1998, DL '98.

[68]  David S. Doermann,et al.  Identifying sports videos using replay, text, and camera motion features , 1999, Electronic Imaging.

[69]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[70]  Frank Nack,et al.  Everything You Wanted to Know About MPEG-7: Part 1 , 1999, IEEE Multim..

[71]  John Zimmerman,et al.  Integrated multimedia processing for topic segmentation and classification , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[72]  Avideh Zakhor,et al.  Content analysis of video using principal components , 1998, IEEE Trans. Circuits Syst. Video Technol..

[73]  Arnold W. M. Smeulders,et al.  A casestudy in performance analysis of recognition of graphical signs. Detecting Arrows , 2001 .

[74]  Nilesh V. Patel,et al.  Video classification using speaker identification , 1997, Electronic Imaging.

[75]  Mubarak Shah,et al.  A framework for segmentation of talk and game shows , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[76]  Boon-Lock Yeo,et al.  Video content characterization and compaction for digital library applications , 1997, Electronic Imaging.

[77]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[78]  Arnold W. M. Smeulders,et al.  Statistical strategy for object class recognition using part detectors , 2001 .

[79]  Marcel Worring,et al.  Face Detection Methods, a critical evaluation , 2000 .

[80]  Shih-Fu Chang,et al.  Structure analysis of sports video using domain models , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[81]  Noboru Babaguchi,et al.  Event based indexing of broadcasted sports video by intermodal collaboration , 2002, IEEE Trans. Multim..

[82]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[83]  Michael J. Witbrock,et al.  Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[84]  Marcel Worring,et al.  Evaluation of logical story unit segmentation in video sequences , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[85]  Marcel Worring,et al.  Systematic evaluation of logical story unit segmentation , 2002, IEEE Trans. Multim..

[86]  Takeo Kanade,et al.  A statistical method for 3D object detection applied to faces and cars , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[87]  A Framework for Segmentation of Talk & Game Shows , 2000 .

[88]  Ioannis Pitas,et al.  Content-based video parsing and indexing based on audio-visual interaction , 2001, IEEE Trans. Circuits Syst. Video Technol..

[89]  Anil K. Jain,et al.  On image classification: city images vs. landscapes , 1998, Pattern Recognit..

[90]  Ba Tu Truong,et al.  Automatic genre identification for content-based video categorization , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[91]  Stefan Eickeler,et al.  Content-based video indexing of TV broadcast news using hidden Markov models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[92]  Alexander G. Hauptmann,et al.  Topic Labeling of Multilingual Broadcast News in the Informedia Digital Video Library , 1999 .

[93]  Joseph M. Boggs The Art of Watching Films , 1978 .

[94]  Gang Wei,et al.  Video classification based on HMM using text and faces , 2000, 2000 10th European Signal Processing Conference.

[95]  Rohini K. Srihari,et al.  Automatic Indexing and Content-Based Retrieval of Captioned Images , 1995, Computer.

[96]  Douglas W. Oard,et al.  The State of the Art in Text Filtering , 1997, User Modeling and User-Adapted Interaction.

[97]  Joost van de Weijer,et al.  Fast Anisotropic Gauss Filtering , 2002, ECCV.

[98]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[99]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[100]  A.W.M. Smeulders,et al.  Requirements for generic grouping in vision and an algorithm , 2001 .

[101]  Rainer Lienhart,et al.  Scene Determination Based on Video and Audio Features , 2004, Multimedia Tools and Applications.

[102]  K. Selçuk Candan,et al.  The Advanced Video Information System: data structures and query processing , 1996, Multimedia Systems.

[103]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[104]  Dennis Koelma,et al.  Efficient applications in user transparent parallel image processing , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[105]  Frank Nack,et al.  Everything You Wanted to Know About MPEG-7: Part 2 , 1999, IEEE Multim..

[106]  Marcel Worring,et al.  Searching in document images: what does the appearance of a document tell us about what it means? , 2001 .

[107]  Weiyu Zhu,et al.  Automatic news video segmentation and categorization based on closed-captioned text , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[108]  Marco Aiello,et al.  Document understanding for a broad class of documents , 2002, Int. J. Document Anal. Recognit..

[109]  Borko Furht,et al.  Video and Image Processing in Multimedia Systems , 1995 .

[110]  Yihong Gong,et al.  Automatic parsing and indexing of news video , 1995, Multimedia Systems.

[111]  Hiroshi Hamada,et al.  Video Handling with Music and Speech Detection , 1998, IEEE Multim..

[112]  Ramesh C. Jain,et al.  Detecting events from continuous media by intermodal collaboration and knowledge use , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[113]  Svetha Venkatesh,et al.  Incorporating Domain Knowledge with Video and Voice Data Analysis in News Broadcasts , 2000, MDM/KDD.

[114]  Anil K. Jain,et al.  Automatic caption localization in compressed video , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[115]  Chitra Dorai,et al.  Automatic text extraction from video for content-based annotation and retrieval , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).