Matching Words and Pictures

We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann's hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation (MoM-LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.

[1]  G. Webbe,et al.  Any Questions , 1946, The Indian medical gazette.

[2]  M. McCarthy The statistical approach , 1959 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Rohini K. Srihari Extracting visual information from text: using captions to label faces in newspaper photographs , 1992 .

[5]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[6]  Bella Hass Weinberg,et al.  Challenges in indexing electronic text and images , 1994 .

[7]  Venu Govindaraju,et al.  Use of Collateral Text in Image Interpretation , 1994 .

[8]  Debra T. Burhans,et al.  Visual Semantics: Extracting Visual information from Text Accompanying Pictures , 1994, AAAI.

[9]  Gilles Celeux,et al.  On Stochastic Versions of the EM Algorithm , 1995 .

[10]  Peter G. B. Enser,et al.  Progress in Documentation Pictorial Information Retrieval , 1995, J. Documentation.

[11]  Michael J. Swain,et al.  WebSeer: An Image Search Engine for the World Wide Web , 1996 .

[12]  David A. Forsyth,et al.  Finding Naked People , 1996, ECCV.

[13]  Peter G. B. Enser,et al.  Analysis of user need in image archives , 1997, J. Inf. Sci..

[14]  Takeo Kanade,et al.  Name-It: association of face and name in video , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Tomaso A. Poggio,et al.  Pedestrian detection using wavelet templates , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Oded Maron,et al.  Learning from Ambiguity , 1998 .

[18]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[19]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[20]  Thomas Hofmann,et al.  Learning and representing topic-a hierarchical mixture model for word occurences in document databas , 1998 .

[21]  S. Sclaroff,et al.  Combining textual and visual cues for content-based image retrieval on the World Wide Web , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[22]  Hinrich Schütze,et al.  Multimodal browsing of images in Web documents , 1999, Electronic Imaging.

[23]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[24]  David A. Forsyth,et al.  Computer Vision Tools for Finding Images and Video Sequences , 1999, Libr. Trends.

[25]  John C. Dalton,et al.  Hierarchical browsing and search of large image databases , 2000, IEEE Trans. Image Process..

[26]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[27]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[28]  David A. Forsyth,et al.  Clustering art , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[29]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[30]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[31]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[32]  Dan Tufis,et al.  Empirical Methods for Exploiting Parallel Texts , 2002, Lit. Linguistic Comput..

[33]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  T. Allen Thank you. , 2003, CJEM.

[35]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[36]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[37]  EstimationPeter,et al.  The Mathematics of Machine Translation : Parameter , 2004 .

[38]  Karen M. Drabenstott,et al.  Browse and Search Patterns in a Digital Image Database , 2004, Information Retrieval.

[39]  Eero Sormunen,et al.  End-User Searching Challenges Indexing Practices in the Digital Newspaper Photo Archive , 2004, Information Retrieval.

[40]  A. Volgenant,et al.  A shortest augmenting path algorithm for dense and sparse linear assignment problems , 1987, Computing.