1 Matching Words and Pictures

We present a new and very rich approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann’s hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to latent dirichlet allocation (LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.

[1]  Rohini K. Srihari Extracting visual information from text: using captions to label faces in newspaper photographs , 1992 .

[2]  Bella Hass Weinberg,et al.  Challenges in indexing electronic text and images , 1994 .

[3]  Venu Govindaraju,et al.  Use of Collateral Text in Image Interpretation , 1994 .

[4]  Debra T. Burhans,et al.  Visual Semantics: Extracting Visual information from Text Accompanying Pictures , 1994, AAAI.

[5]  Gilles Celeux,et al.  On Stochastic Versions of the EM Algorithm , 1995 .

[6]  Peter G. B. Enser,et al.  Progress in Documentation Pictorial Information Retrieval , 1995, J. Documentation.

[7]  Michael J. Swain,et al.  WebSeer: An Image Search Engine for the World Wide Web , 1996 .

[8]  David A. Forsyth,et al.  Finding Naked People , 1996, ECCV.

[9]  Peter G. B. Enser,et al.  Analysis of user need in image archives , 1997, J. Inf. Sci..

[10]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Oded Maron,et al.  Learning from Ambiguity , 1998 .

[12]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[13]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[14]  S. Sclaroff,et al.  Combining textual and visual cues for content-based image retrieval on the World Wide Web , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[15]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[16]  David A. Forsyth,et al.  Computer Vision Tools for Finding Images and Video Sequences , 1999, Libr. Trends.

[17]  Fran ine Chena,et al.  Multi-Modal Browsing of Images in Web Do uments , 1999 .

[18]  John C. Dalton,et al.  Hierarchical browsing and search of large image databases , 2000, IEEE Trans. Image Process..

[19]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[20]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[21]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[22]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[23]  Dan Tufis,et al.  Empirical Methods for Exploiting Parallel Texts , 2002, Lit. Linguistic Comput..

[24]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  EstimationPeter,et al.  The Mathematics of Machine Translation : Parameter , 2004 .

[26]  Karen M. Drabenstott,et al.  Browse and Search Patterns in a Digital Image Database , 2004, Information Retrieval.

[27]  Eero Sormunen,et al.  End-User Searching Challenges Indexing Practices in the Digital Newspaper Photo Archive , 2004, Information Retrieval.

[28]  A. Volgenant,et al.  A shortest augmenting path algorithm for dense and sparse linear assignment problems , 1987, Computing.