Cross Modal Disambiguation

We consider strategies for reducing ambiguity in multi-modal data, particularly in the domain of images and text. Large data sets containing images with associated text (and vice versa) are readily available, and recent work has exploited such data to learn models for linking visual elements to semantics. This requires addressing a correspondence ambiguity because it is generally not known which parts of the images connect with which language elements. In this paper we first discuss using language processing to reduce correspondence ambiguity in loosely labeled image data. We then consider a similar problem of using visual correlates to reduce ambiguity in text with associated images. Only rudimentary image understanding is needed for this task because the image only needs to help differentiate between a limited set of choices, namely the senses of a particular word.

[1]  Thomas Hofmann,et al.  Multiple instance learning with generalized support vector machines , 2002, AAAI/IAAI.

[2]  S. Sclaroff,et al.  Combining textual and visual cues for content-based image retrieval on the World Wide Web , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[3]  Jitendra Malik,et al.  Shape matching and object recognition using low distortion correspondences , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[5]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[6]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[7]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[8]  Andrés Montoyo,et al.  Interface for WordNet Enrichment with Classification Systems , 2001, DEXA.

[9]  David A. Forsyth,et al.  The effects of segmentation and feature choice in a translation model of object recognition , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[10]  Antonio Torralba,et al.  Sharing features: efficient boosting procedures for multiclass object detection , 2004, CVPR 2004.

[11]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[12]  Nando de Freitas,et al.  A Statistical Model for General Contextual Object Recognition , 2004, ECCV.

[13]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[14]  Rada Mihalcea,et al.  Word Sense Disambiguation based on Semantic Density , 1998, WordNet@ACL/COLING.

[15]  Anna Herwig,et al.  Lexicon and Grammar , 2005 .

[16]  Sally A. Goldman,et al.  Multiple-Instance Learning of Real-Valued Data , 2001, J. Mach. Learn. Res..

[17]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[18]  W. N. Locke,et al.  Machine Translation of Languages , 1956 .

[19]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[20]  Rada Mihalcea,et al.  SenseLearner: Minimally supervised Word Sense Disambiguation for all words in open text , 2004, SENSEVAL@ACL.

[21]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[22]  Yehoshua Bar-Hillel,et al.  The Present Status of Automatic Translation of Languages , 1960, Adv. Comput..

[23]  Mitchell Marcus,et al.  Empirical Methods for Exploiting Parallel Texts , 2001 .

[24]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[25]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[26]  Qi Zhang,et al.  Content-Based Image Retrieval Using Multiple-Instance Learning , 2002, ICML.

[27]  Rada Mihalcea,et al.  SenseLearner: Word Sense Disambiguation for All Words in Unrestricted Text , 2005, ACL.

[28]  Eneko Agirre,et al.  Word Sense Disambiguation using Conceptual Density , 1996, COLING.

[29]  Jitendra Malik,et al.  Blobworld: A System for Region-Based Image Indexing and Retrieval , 1999, VISUAL.

[30]  Anthony Hoogs,et al.  Evaluation of Localized Semantics: Data, Methodology, and Experiments , 2008, International Journal of Computer Vision.

[31]  Eneko Agirre,et al.  A Proposal for Word Sense Disambiguation using Conceptual Distance , 1995, ArXiv.

[32]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[33]  Rada Mihalcea,et al.  An Iterative Approach to Word Sense Disambiguation , 2000, FLAIRS.

[34]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[35]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[36]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[37]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Kobus Barnard,et al.  Word Sense Disambiguation with Pictures , 2003, Artif. Intell..

[39]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[40]  Robert Wilensky,et al.  Experiments in Improving Unsupervised Word Sense Disambiguation , 2003 .

[41]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, CVPR 2004.

[42]  Mads Nielsen,et al.  Computer Vision — ECCV 2002 , 2002, Lecture Notes in Computer Science.

[43]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[44]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[45]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[46]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[47]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[48]  Kobus Barnard,et al.  Evaluating image retrieval , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[49]  B. S. Manjunath,et al.  Unsupervised Segmentation of Color-Texture Regions in Images and Video , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  Makoto Nagao,et al.  General Word Sense Disambiguation Method Based on a Full Sentential Context , 1998, WordNet@ACL/COLING.