Recognition as Translating Images into Text

We present an overview of a new paradigm for tackling long standing computer vision problems. Specifically our approach is to build statistical models which translate from a visual representations (images) to semantic ones (associated text). As providing optimal text for training is difficult at best, we propose working with whatever associated text is available in large quantities. Examples include large image collections with keywords, museum image collections with descriptive text, news photos, and images on the web. In this paper we discuss how the translation approach can give a handle on difficult questions such as: What counts as an object? Which objects are easy to recognize and which are hard? Which objects are indistinguishable using our features? How to integrate low level vision processes such as feature based segmentation, with high level processes such as grouping. We also summarize some of the models proposed for translating from visual information to text, and some of the methods used to evaluate their performance.

[1]  Peter G. B. Enser,et al.  Analysis of user need in image archives , 1997, J. Inf. Sci..

[2]  David A. Forsyth,et al.  Finding Naked People , 1996, ECCV.

[3]  Simone Santini Semantic modalities in content-based retrieval , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[4]  Dan Tufis,et al.  Empirical Methods for Exploiting Parallel Texts , 2002, Lit. Linguistic Comput..

[5]  Fran ine Chena,et al.  Multi-Modal Browsing of Images in Web Do uments , 1999 .

[6]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[7]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Karen M. Drabenstott,et al.  Browse and Search Patterns in a Digital Image Database , 2004, Information Retrieval.

[12]  John P. Eakins,et al.  Towards intelligent image retrieval , 2002, Pattern Recognit..

[13]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[14]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[15]  Rohini K. Srihari Extracting visual information from text: using captions to label faces in newspaper photographs , 1992 .

[16]  Peter G. B. Enser,et al.  Progress in Documentation Pictorial Information Retrieval , 1995, J. Documentation.

[17]  EstimationPeter,et al.  The Mathematics of Machine Translation : Parameter , 2004 .

[18]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  John C. Dalton,et al.  Hierarchical browsing and search of large image databases , 2000, IEEE Trans. Image Process..

[20]  Simone Santini,et al.  Emergent Semantics through Interaction in Image Databases , 2001, IEEE Trans. Knowl. Data Eng..

[21]  S. Sclaroff,et al.  Combining textual and visual cues for content-based image retrieval on the World Wide Web , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[22]  Tomaso A. Poggio,et al.  Pedestrian detection using wavelet templates , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Oded Maron,et al.  Learning from Ambiguity , 1998 .

[24]  Neil C. Rowe Marie-4: A High-Recall, Self-Improving Web Crawler That Finds Images Using Captions , 2002, IEEE Intell. Syst..

[25]  Venu Govindaraju,et al.  Use of Collateral Text in Image Interpretation , 1994 .

[26]  IEEE conference on computer vision and pattern recognition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[27]  Eero Sormunen,et al.  End-User Searching Challenges Indexing Practices in the Digital Newspaper Photo Archive , 2004, Information Retrieval.

[28]  David A. Forsyth,et al.  Computer Vision Tools for Finding Images and Video Sequences , 1999, Libr. Trends.

[29]  Debra T. Burhans,et al.  Visual Semantics: Extracting Visual information from Text Accompanying Pictures , 1994, AAAI.

[30]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[31]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[32]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  Hinrich Schütze,et al.  Multimodal browsing of images in Web documents , 1999, Electronic Imaging.

[34]  Michael J. Swain,et al.  WebSeer: An Image Search Engine for the World Wide Web , 1996 .

[35]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[36]  Bella Hass Weinberg,et al.  Challenges in indexing electronic text and images , 1994 .

[37]  A. Volgenant,et al.  A shortest augmenting path algorithm for dense and sparse linear assignment problems , 1987, Computing.

[38]  David A. Forsyth,et al.  Modeling the statistics of image features and associated text , 2001, IS&T/SPIE Electronic Imaging.

[39]  David A. Forsyth,et al.  Clustering art , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[40]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.