Content-based video parsing and indexing based on audio-visual interaction

A content-based video parsing and indexing method is presented in this paper, which analyzes both information sources (auditory and visual) and accounts for their inter-relations and synergy to extract high-level semantic information. Both frame- and object-based access to the visual information is employed. The aim of the method is to extract semantically meaningful video scenes and assign semantic label(s) to them. Due to the temporal nature of video, time has to be accounted for. Thus, time-constrained video representations and indices are generated. The current approach searches for specific types of content information relevant to the presence or absence of speakers or persons. Audio-source parsing and indexing leads to the extraction of a speaker label mapping of the source over time. Video-source parsing and indexing results in the extraction of a talking-face shot mapping over time. Integration of the audio and visual mappings constrained by interaction rules leads to higher levels of video abstraction and even partial detection of its context.

[1]  Michal Irani,et al.  Video indexing based on mosaic representations , 1998, Proc. IEEE.

[2]  Shih-Fu Chang,et al.  Next-generation content representation, creation, and searching for new-media applications in education , 1998 .

[3]  Riccardo Leonardi,et al.  Audio as a support to scene change detection and characterization of video sequences , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[5]  P. Anandan,et al.  Efficient representations of video sequences and their applications , 1996, Signal Process. Image Commun..

[6]  Takeo Kanade,et al.  Intelligent Access to Digital Video: Informedia Project , 1996, Computer.

[7]  A. Murat Tekalp,et al.  Content-based video abstraction , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[8]  Harpreet S. Sawhney,et al.  Compact Representations of Videos Through Dominant and Multiple Motion Estimation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Boon-Lock Yeo,et al.  Video visualization for compact presentation and fast browsing of pictorial content , 1997, IEEE Trans. Circuits Syst. Video Technol..

[10]  Konstantinos N. Plataniotis,et al.  A color segmentation and classification scheme for facial image and video retrieval extended summary , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[11]  Carl Malamud,et al.  Speaker identification based text to audio alignment for an audio retrieval system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[13]  Richard M. Schwartz,et al.  Improved topic discrimination of broadcast news using a model of multiple simultaneous topics , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[15]  M. Ibrahim Sezan,et al.  MPEG-7 standardization activities , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[16]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[17]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[18]  Jeho Nam,et al.  Combined audio and visual streams analysis for video sequence segmentation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Qi Tian,et al.  An automatic news video parsing, indexing and browsing system , 1997, MULTIMEDIA '96.

[20]  I. Pitas Speaker Identiication for Audio Indexing Applications , 1998 .

[21]  Takeo Kanade,et al.  Semantic analysis for video contents extraction—spotting by association in news video , 1997, MULTIMEDIA '97.

[22]  Shih-Fu Chang,et al.  Development of Columbia's video on demand testbed , 1996, Signal Process. Image Commun..

[23]  A. Murat Tekalp,et al.  Effective content representation for video , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[24]  Jian Feng,et al.  Scene change detection algorithm for MPEG video sequence , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[25]  A.M. Alattar Wipe scene change detector for segmenting uncompressed video sequences , 1998, ISCAS '98. Proceedings of the 1998 IEEE International Symposium on Circuits and Systems (Cat. No.98CH36187).

[26]  John M. Gauch,et al.  Vision: a digital video library , 1996, DL '96.

[27]  S Tsekeridou,et al.  Speaker dependent video indexing based on audio-visual interaction , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[28]  D. E. Blahut,et al.  Interactive television , 1995, Proc. IEEE.

[29]  Alan Hanjalic,et al.  Template-based detection of anchorperson shots in news programs , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[30]  Jinwoong Kim,et al.  Hierarchical scene change detection in an MPEG-2 compressed video sequence , 1998, ISCAS '98. Proceedings of the 1998 IEEE International Symposium on Circuits and Systems (Cat. No.98CH36187).

[31]  Borivoje Furht Multimedia Systems and Techniques , 1996 .

[32]  Jeho Nam,et al.  Speaker identification and video analysis for hierarchical video shot classification , 1997, Proceedings of International Conference on Image Processing.

[33]  Fernando Pereira,et al.  The role of analysis in content-based video coding and indexing , 1998, Signal Process..

[34]  Jeho Nam,et al.  Audio-visual content-based violent scene characterization , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[35]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[36]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[37]  Carlo S. Regazzoni,et al.  Content-based retrieval and real time detection from video sequences acquired by surveillance systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[38]  Nuno Vasconcelos,et al.  Towards semantically meaningful feature spaces for the characterization of video content , 1997, Proceedings of International Conference on Image Processing.

[39]  A. Murat Tekalp,et al.  Temporal video segmentation using unsupervised clustering and semantic object tracking , 1998, J. Electronic Imaging.

[40]  Ioannis Pitas,et al.  A novel method for automatic face segmentation, facial feature extraction and tracking , 1998, Signal Process. Image Commun..

[41]  Tat-Seng Chua,et al.  A video retrieval and sequencing system , 1995, TOIS.

[42]  Thomas Sikora,et al.  The MPEG-4 video standard verification model , 1997, IEEE Trans. Circuits Syst. Video Technol..

[43]  Ioannis Pitas,et al.  Facial feature extraction in frontal views using biometric analogies , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[44]  Stephen W. Smoliar,et al.  Content based video indexing and retrieval , 1994, IEEE MultiMedia.