Unsupervised Learning of Visual Sense Models for Polysemous Words

Polysemy is a problem for methods that exploit image search engines to build object category models. Existing unsupervised approaches do not take word sense into consideration. We propose a new method that uses a dictionary to learn models of visual word sense from a large collection of unlabeled web data. The use of LDA to discover a latent sense space makes the model robust despite the very limited nature of dictionary definitions. The definitions are used to learn a distribution in the latent space that best represents a sense. The algorithm then uses the text surrounding image links to retrieve images with high probability of a particular dictionary sense. An object classifier is trained on the resulting sense-specific images. We evaluate our method on a dataset obtained by searching the web for polysemous words. Category classification experiments show that our dictionary-based approach outperforms baseline methods.

[1]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[2]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[3]  Mingjing Li,et al.  Web mining for Web image retrieval , 2001, J. Assoc. Inf. Sci. Technol..

[4]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[7]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  David A. Forsyth,et al.  Animals on the Web , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Keiji Yanai,et al.  Cross Modal Disambiguation , 2006, Toward Category-Level Object Recognition.

[11]  Andrew McCallum,et al.  People-LDA: Anchoring Topics to People using Face Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[12]  Antonio Criminisi,et al.  Harvesting Image Databases from the Web , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  Fei-Fei Li,et al.  OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.