Combining Textual and Visual Information for Semantic Labeling of Images and Videos

Semantic labeling of large volumes of image and video archives is difficult, if not impossible, with the traditional methods due to the huge amount of human effort required for manual labeling used in a supervised setting. Recently, semi-supervised techniques which make use of annotated image and video collections are proposed as an alternative to reduce the human effort. In this direction, different techniques, which are mostly adapted from information retrieval literature, are applied to learn the unknown one-to-one associations between visual structures and semantic descriptions. When the links are learned, the range of application areas is wide including better retrieval and automatic annotation of images and videos, labeling of image regions as a way of large-scale object recognition and association of names with faces as a way of large-scale face recognition. In this chapter, after reviewing and discussing a variety of related studies, we present two methods in detail, namely, the so called “translation approach” which translates the visual structures to semantic descriptors using the idea of statistical machine translation techniques, and another approach which finds the densest component of a graph corresponding to the largest group of similar visual structures associated with a semantic description.

[1]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[2]  Daniel Gatica-Perez,et al.  PLSA-based image auto-annotation: constraining the latent space , 2004, MULTIMEDIA '04.

[3]  Ching-Yung Lin,et al.  Video Collaborative Annotation Forum: Establishing Ground-Truth Labels on Large Multimedia Datasets , 2003, TRECVID.

[4]  Hinrich Schütze,et al.  Multimodal browsing of images in Web documents , 1999, Electronic Imaging.

[5]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[6]  Luo Si,et al.  Effective automatic image annotation via a coherent language model and active learning , 2004, MULTIMEDIA '04.

[7]  Sanjeev Khudanpur,et al.  Hidden Markov models for automatic annotation and content-based retrieval of images and video , 2005, SIGIR '05.

[8]  William I. Grosky,et al.  Narrowing the semantic gap - improved text-based web document retrieval using visual features , 2002, IEEE Trans. Multim..

[9]  Peter G. B. Enser,et al.  Analysis of user need in image archives , 1997, J. Inf. Sci..

[10]  Nando de Freitas,et al.  A Statistical Model for General Contextual Object Recognition , 2004, ECCV.

[11]  S. Sclaroff,et al.  Combining textual and visual cues for content-based image retrieval on the World Wide Web , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[12]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[13]  David A. Forsyth,et al.  Whos In the Picture , 2004, NIPS.

[14]  Debra T. Burhans,et al.  Visual Semantics: Extracting Visual information from Text Accompanying Pictures , 1994, AAAI.

[15]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[16]  Rong Jin,et al.  Automatic image annotation , 2007 .

[17]  Tat-Seng Chua,et al.  A bootstrapping framework for annotating and retrieving WWW images , 2004, MULTIMEDIA '04.

[18]  Jun Yang,et al.  Finding Person X: Correlating Names with Visual Appearances , 2004, CIVR.

[19]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Christos Faloutsos,et al.  Automatic image captioning , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[21]  Ebroul Izquierdo,et al.  Semantic labeling of images combining color, texture and keywords , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[22]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[23]  Christos Faloutsos,et al.  Automatic multimedia cross-modal correlation discovery , 2004, KDD.

[24]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[25]  Shi-Kuo Chang,et al.  Image Information Systems: Where Do We Go From Here? , 1992, IEEE Trans. Knowl. Data Eng..

[26]  Gustavo Carneiro,et al.  Formulating semantic image annotation as a supervised learning problem , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[29]  Takeo Kanade,et al.  Name-It: association of face and name in video , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[31]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[32]  R. Manmatha,et al.  Using Maximum Entropy for Automatic Image Annotation , 2004, CIVR.

[33]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[34]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[35]  Daniel Gatica-Perez,et al.  On image auto-annotation with latent space models , 2003, ACM Multimedia.

[36]  Harriet J. Nock,et al.  Semantic annotation of multimedia using maximum entropy models , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[37]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[38]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Pinar Duygulu Sahin,et al.  Systematic Evaluation of Machine Translation Methods for Image and Video Annotation , 2005, CIVR.

[40]  Moses Charikar,et al.  Greedy approximation algorithms for finding dense components in a graph , 2000, APPROX.

[41]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[42]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[43]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[44]  Mary Czerwinski,et al.  Semi-Automatic Image Annotation , 2001, INTERACT.

[45]  Shih-Fu Chang,et al.  Semantic knowledge construction from annotated image collections , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[46]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.