Extracting semantics from audio-visual content: the final frontier in multimedia retrieval

Multimedia understanding is a fast emerging interdisciplinary research area. There is tremendous potential for effective use of multimedia content through intelligent analysis. Diverse application areas are increasingly relying on multimedia understanding systems. Advances in multimedia understanding are related directly to advances in signal processing, computer vision, pattern recognition, multimedia databases, and smart sensors. We review the state-of-the-art techniques in multimedia retrieval. In particular, we discuss how multimedia retrieval can be viewed as a pattern recognition problem. We discuss how reliance on powerful pattern recognition and machine learning techniques is increasing in the field of multimedia retrieval. We review the state-of-the-art multimedia understanding systems with particular emphasis on a system for semantic video indexing centered around multijects and multinets. We discuss how semantic retrieval is centered around concepts and context and the various mechanisms for modeling concepts and context.

[1]  Brendan J. Frey,et al.  Iterative Decoding of Compound Codes by Probability Propagation in Graphical Models , 1998, IEEE J. Sel. Areas Commun..

[2]  Thomas S. Huang,et al.  Relevance feedback: a power tool for interactive content-based image retrieval , 1998, IEEE Trans. Circuits Syst. Video Technol..

[3]  Brendan J. Frey,et al.  Probability Propagation and Iterative Decoding , 1996 .

[4]  Ziyou Xiong,et al.  Facial Analysis from Continuous Video with Applications to Human-Computer Interface , 2004, International Series on Biometrics.

[5]  Tsuhan Chen,et al.  Audio Feature Extraction and Analysis for Scene Segmentation and Classification , 1998, J. VLSI Signal Process..

[6]  Yücel Altunbasak,et al.  Content-based video retrieval and compression: a unified solution , 1997, Proceedings of International Conference on Image Processing.

[7]  Judith A. Markowitz Using Speech Recognition , 1995 .

[8]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.

[10]  C.-C. Jay Kuo,et al.  Integrated approach to multimodal media content analysis , 1999, Electronic Imaging.

[11]  John R. Smith,et al.  New frontiers for intelligent content-based retrieval , 2001, IS&T/SPIE Electronic Imaging.

[12]  David S. Doermann,et al.  Identifying sports videos using replay, text, and camera motion features , 1999, Electronic Imaging.

[13]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[14]  A. Murat Tekalp,et al.  Probabilistic Analysis and Extraction of Video Content , 1999, ICIP.

[15]  Giridharan Iyengar,et al.  Models for automatic classification of video sequences , 1997, Electronic Imaging.

[16]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Milind R. Naphade,et al.  Inferring semantic concepts for video indexing and retrieval , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[18]  Jonathan D. Courtney Automatic video indexing via object motion analysis , 1997, Pattern Recognit..

[19]  T.S. Huang,et al.  Recognizing high-level audio-visual concepts using context , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[20]  Niclas Wiberg,et al.  Codes and Decoding on General Graphs , 1996 .

[21]  Alberto Del Bimbo,et al.  Content based annotation and retrieval of news videos , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[22]  Milind R. Naphade,et al.  Semantic video indexing using a probabilistic framework , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[23]  James M. Rehg,et al.  Vision-based speaker detection using Bayesian networks , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[24]  Takeo Kanade,et al.  elligent Access Video: formedia Project , 1996 .

[25]  John S. Boreczky,et al.  Finding presentations in recorded meetings using audio and video features , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[26]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[27]  M. Ibrahim Sezan,et al.  A computational approach to semantic event detection , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[28]  Amarnath Gupta,et al.  Virage image search engine: an open framework for image management , 1996, Electronic Imaging.

[29]  Milind R. Naphade,et al.  Probabilistic Semantic Video Indexing , 2000, NIPS.

[30]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[31]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[32]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[33]  Stephen E. Levinson,et al.  Speaker independent audio-visual speech recognition , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[34]  Alberto Del Bimbo,et al.  Retrieval by content of commercials based on dynamics of color flows , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[35]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[36]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[37]  Anil K. Jain,et al.  On image classification: city images vs. landscapes , 1998, Pattern Recognit..

[38]  Sanjeev R. Kulkarni,et al.  Automated analysis and annotation of basketball video , 1997, Electronic Imaging.

[39]  K. Ramchandran,et al.  A factor graph framework for semantic indexing and retrieval in video , 2000, 2000 Proceedings Workshop on Content-based Access of Image and Video Libraries.

[40]  W. Eric L. Grimson,et al.  A framework for learning query concepts in image classification , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[41]  Takeo Kanade,et al.  Semantic analysis for video contents extraction—spotting by association in news video , 1997, MULTIMEDIA '97.

[42]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[43]  Yihong Gong,et al.  Video parsing using compressed data , 1994, Electronic Imaging.

[44]  Shih-Fu Chang,et al.  Spatio-temporal video search using the object based video representation , 1997, Proceedings of International Conference on Image Processing.

[45]  A. Murat Tekalp,et al.  A high-performance shot boundary detection algorithm using multiple cues , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[46]  Milind R. Naphade,et al.  Novel scheme for fast and efficent video sequence matching using compact signatures , 1999, Electronic Imaging.

[47]  Alex Pentland,et al.  Unsupervised clustering of ambulatory audio and video , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[48]  Jeho Nam,et al.  Speaker identification and video analysis for hierarchical video shot classification , 1997, Proceedings of International Conference on Image Processing.

[49]  Milind R. Naphade,et al.  Classifying motion picture soundtrack for video indexing , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[50]  Boon-Lock Yeo,et al.  Rapid scene analysis on compressed video , 1995, IEEE Trans. Circuits Syst. Video Technol..

[51]  Thomas S. Huang,et al.  Image classification using a set of labeled and unlabeled images , 2000, SPIE Optics East.

[52]  John Cocke,et al.  Optimal decoding of linear codes for minimizing symbol error rate (Corresp.) , 1974, IEEE Trans. Inf. Theory.

[53]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[54]  Wolfgang Effelsberg,et al.  Automatic recognition of film genres , 1995, MULTIMEDIA '95.

[55]  Jamshid Shanbehzadeh,et al.  Image indexing and retrieval techniques: past, present, and next , 1999, Electronic Imaging.

[56]  Thomas S. Huang,et al.  Factor graph framework for semantic video indexing , 2002, IEEE Trans. Circuits Syst. Video Technol..

[57]  Wolfgang Effelsberg,et al.  Automatic audio content analysis , 1997, MULTIMEDIA '96.

[58]  Shih-Fu Chang,et al.  Generating semantic visual templates for video databases , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[59]  Milind R. Naphade,et al.  Stochastic modeling of soundtrack for efficient segmentation and indexing of video , 1999, Electronic Imaging.

[60]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[61]  Milind R. Naphade,et al.  Supporting audiovisual query using dynamic programming , 2001, MULTIMEDIA '01.

[62]  H. Vincent Poor,et al.  An Introduction to Signal Detection and Estimation , 1994, Springer Texts in Electrical Engineering.

[63]  Minerva M. Yeung,et al.  Efficient matching and clustering of video shots , 1995, Proceedings., International Conference on Image Processing.

[64]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[65]  Shih-Fu Chang,et al.  Semantic visual templates: linking visual features to semantics , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[66]  Takeo Kanade,et al.  Intelligent Access to Digital Video: Informedia Project , 1996, Computer.

[67]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[68]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[69]  Alexander G. Hauptmann,et al.  Learning to Recognize Speech by Watching Television , 1999, IEEE Intell. Syst..

[70]  Wayne H. Wolf,et al.  Hidden Markov model parsing of video programs , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[71]  Milind R. Naphade,et al.  Multimodal pattern matching for audio-visual query and retrieval , 2001, IS&T/SPIE Electronic Imaging.

[72]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[73]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[74]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[75]  Thomas S. Huang,et al.  A probablistic framework for mapping audio-visual features to high-level semantics in terms of concepts and context , 2001 .

[76]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[77]  David A. Landgrebe,et al.  The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon , 1994, IEEE Trans. Geosci. Remote. Sens..

[78]  Hiroshi Hamada,et al.  Video Handling with Music and Speech Detection , 1998, IEEE Multim..

[79]  Milind R. Naphade,et al.  Duration dependent input output markov models for audio-visual event detection , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..