Multimedia content analysis-using both audio and visual clues

Multimedia content analysis refers to the computerized understanding of the semantic meanings of a multimedia document, such as a video sequence with an accompanying audio track. With a multimedia document, its semantics are embedded in multiple forms that are usually complimentary of each other, Therefore, it is necessary to analyze all types of data: image frames, sound tracks, texts that can be extracted from image frames, and spoken words that can be deciphered from the audio track. This usually involves segmenting the document into semantically meaningful units, classifying each unit into a predefined scene type, and indexing and summarizing the document for efficient retrieval and browsing. We review advances in using audio and visual information jointly for accomplishing the above tasks. We describe audio and visual features that can effectively characterize scene content, present selected algorithms for segmentation and classification, and review some testbed systems for video archiving and retrieval. We also describe audio and visual descriptors and description schemes that are being considered by the MPEG-7 standard for multimedia content description.

[1]  John S. Boreczky,et al.  A hidden Markov model framework for video segmentation using audio and image features , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Calvin C. Gotlieb,et al.  Texture descriptors based on co-occurrence matrices , 1990, Comput. Vis. Graph. Image Process..

[3]  Wenjun Zeng,et al.  Integrated image and speech analysis for content-based video indexing , 1996, Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems.

[4]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[5]  Behzad Shahraray,et al.  Scene change detection and content-based sampling of video sequences , 1995, Electronic Imaging.

[6]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[7]  Wolfgang Effelsberg,et al.  Scene Determination Based on Video and Audio Features , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[8]  Sethuraman Panchanathan,et al.  Review of Image and Video Indexing Techniques , 1997, J. Vis. Commun. Image Represent..

[9]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Rainer Lienhart,et al.  Automatic text recognition for video indexing , 1997, MULTIMEDIA '96.

[11]  Hiroshi Hamada,et al.  Video Handling with Music and Speech Detection , 1998, IEEE Multim..

[12]  Qian Huang,et al.  Detecting news reporting using audio/visual information , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[13]  M. Furst,et al.  Neural network based model for classification of music type , 1995, Eighteenth Convention of Electrical and Electronics Engineers in Israel.

[14]  Takeo Kanade,et al.  Video OCR: indexing digital news libraries by recognition of superimposed captions , 1999, Multimedia Systems.

[15]  Yihong Gong,et al.  Lessons Learned from Building a Terabyte Digital Video Library , 1999, Computer.

[16]  C.-C. Jay Kuo,et al.  Wavelet descriptor of planar curves: theory and applications , 1996, IEEE Trans. Image Process..

[17]  Rainer Lienhart,et al.  Comparison of automatic shot boundary detection algorithms , 1998, Electronic Imaging.

[18]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[19]  Thomas S. Huang,et al.  Real-time lip tracking and bimodal continuous speech recognition , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[20]  Riccardo Leonardi,et al.  Audio as a support to scene change detection and characterization of video sequences , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Michael G. Christel,et al.  Evolving video skims into useful multimedia abstractions , 1998, CHI.

[22]  Esther M. Arkin,et al.  An efficiently computable metric for comparing polygonal shapes , 1991, SODA '90.

[23]  Takeo Kanade,et al.  Intelligent Access to Digital Video: Informedia Project , 1996, Computer.

[24]  Michael A. Smith,et al.  Video skimming and characterization through the combination of image and language understanding techniques , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Tsuhan Chen,et al.  Multimedia content classification using motion and audio information , 1997, Proceedings of 1997 IEEE International Symposium on Circuits and Systems. Circuits and Systems in the Information Age ISCAS '97.

[27]  C.-C. Jay Kuo,et al.  A new approach to image retrieval with hierarchical color clustering , 1998, IEEE Trans. Circuits Syst. Video Technol..

[28]  Tsuhan Chen,et al.  Audio Feature Extraction and Analysis for Scene Segmentation and Classification , 1998, J. VLSI Signal Process..

[29]  C.-C. Jay Kuo,et al.  Video content parsing based on combined audio and visual information , 1999, Optics East.

[30]  Mei-Yuh Hwang,et al.  Improving speech recognition performance via phone-dependent VQ codebooks and adaptive language models in SPHINX-II , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Frank Eugene Beaver,et al.  Dictionary of film terms , 1983 .

[32]  Riccardo Leonardi,et al.  Indexing audiovisual databases through joint audio and video processing , 1998, Int. J. Imaging Syst. Technol..

[33]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[34]  Riccardo Leonardi,et al.  Identification of story units in audio-visual sequences by joint audio and video processing , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[35]  Frank Nack,et al.  Everything You Wanted to Know About MPEG-7: Part 2 , 1999, IEEE Multim..

[36]  Ramin Zabih,et al.  Comparing images using color coherence vectors , 1997, MULTIMEDIA '96.

[37]  Robert J. Safranek,et al.  Signal compression based on models of human perception , 1993, Proc. IEEE.

[38]  Zhu Liu,et al.  Integration of multimodal features for video scene classification based on HMM , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[39]  David C. Gibbon,et al.  Pictorial transcripts: multimedia processing applied to digital library creation , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[40]  Wolfgang Effelsberg,et al.  Abstracting Digital Movies Automatically , 1996, J. Vis. Commun. Image Represent..

[41]  Hideyuki Tamura,et al.  Textural Features Corresponding to Visual Perception , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[42]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[43]  Wolfgang Effelsberg,et al.  Automatic recognition of film genres , 1995, MULTIMEDIA '95.

[44]  Zhu Liu,et al.  Integration of audio and visual information for content-based video segmentation , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[45]  Frank Nack,et al.  Everything You Wanted to Know About MPEG-7: Part 1 , 1999, IEEE Multim..

[46]  Edward K. Wong,et al.  Augmented image histogram for image and video similarity search , 1998, Electronic Imaging.

[47]  Mohan S. Kankanhalli,et al.  Shape Measures for Content Based Image Retrieval: A Comparison , 1997, Inf. Process. Manag..

[48]  Wolfgang Effelsberg,et al.  Automatic audio content analysis , 1997, MULTIMEDIA '96.

[49]  C.-C. Jay Kuo,et al.  Hierarchical classification of audio data for archiving and retrieving , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[50]  Josef Kittler,et al.  Robust and Efficient Shape Indexing through Curvature Scale Space , 1996, BMVC.

[51]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[52]  Mark B. Sandler,et al.  Classification of audio signals using statistical features on time and wavelet transform domains , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[53]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[54]  Yihong Gong,et al.  Image indexing and retrieval based on human perceptual color clustering , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[55]  Wolfgang Effelsberg,et al.  Video abstracting , 1997, CACM.

[56]  Zhu Liu,et al.  Classification TV programs based on audio information using hidden Markov model , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[57]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[58]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.