Exploiting words and pictures

There are billions of images with associated text available on the web. Some common areas where pictures and words are naturally linked include: web pages, captioned photographs, and video with speech or closed captioning. The central question that needs to be solved in order to organize these collections effectively is how to extract images in which specified objects are depicted from large pools of pictures with noisy text. This problem is challenging, because the relationship between words associated with an image and objects depicted within the image is often complex. This thesis demonstrates that, for many situations, collections of illustrated material can be exploited by using information from both the images themselves and from the associated text. The first project demonstrates that one can build a large collection of labeled face images by: identifying faces in images; identifying names in captions; then linking the faces and the names. The process of linking uses the fact that images of the same person tend to look more similar—in appropriate features—than images of different people. Furthermore, the structure of the language in a caption often supplies important cues as to which of the named people actually appear in the image. The second project shows that relations between words and images are strong, even when the text has a less formal structure than captions do. Images retrieved from the internet are classified as containing one of a set of animals or not, using both text that appears near the image and a set of simple image appearance descriptors. Animals are notoriously difficult to identify, because their appearance changes quite dramatically; however, this combination of words and weak appearance descriptors gives us a rather accurate classifier. The third project deals with the tendency of users to attach labels to images that do not belong there, typically because labels are attached to a whole set of images rather than to each image individually. This means that, for example, many images labeled with "Chrysler building" do not in fact depict that building. However, the ones that do tend to look similar in an appropriate sense, and it is possible to find images that are iconic representations of such a category using this cue.

[1]  J. Meigs,et al.  WHO Technical Report , 1954, The Yale Journal of Biology and Medicine.

[2]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[4]  L Sirovich,et al.  Low-dimensional Procedure for the Characterization of Human Faces , 1986 .

[5]  Azriel Rosenfeld,et al.  Computer Vision , 1988, Adv. Comput..

[6]  Venu Govindaraju,et al.  Locating human faces in newspaper photographs , 1989, Proceedings CVPR '89: IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Richard P. Lippmann,et al.  Proceedings of the 1997 conference on Advances in neural information processing systems 10 , 1990 .

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  Shumeet Baluja,et al.  Advances in Neural Information Processing , 1994 .

[10]  Rohini K. Srihari,et al.  Automatic Indexing and Content-Based Retrieval of Captioned Images , 1995, Computer.

[11]  Tomaso A. Poggio,et al.  Finding Human Faces with a Gaussian Mixture Distribution-Based Face Model , 1995, ACCV.

[12]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[13]  Takeo Kanade,et al.  Human Face Detection in Visual Scenes , 1995, NIPS.

[14]  Michael J. Swain,et al.  WebSeer: An Image Search Engine for the World Wide Web , 1996 .

[15]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[16]  Amarnath Gupta,et al.  Virage image search engine: an open framework for image management , 1996, Electronic Imaging.

[17]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[18]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[19]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[20]  Takeo Kanade,et al.  Name-It: association of face and name in video , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  Jun S. Liu,et al.  Sequential Monte Carlo methods for dynamic systems , 1997 .

[22]  Mark D. Dunlop,et al.  Image retrieval by hypertext links , 1997, SIGIR '97.

[23]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[24]  Tomaso A. Poggio,et al.  Example-Based Learning for View-Based Human Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Takeo Kanade,et al.  Rotation invariant neural network-based face detection , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[26]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[28]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[29]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[30]  S. Sclaroff,et al.  Combining textual and visual cues for content-based image retrieval on the World Wide Web , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[31]  Andrew Zisserman,et al.  Robust computation and parametrization of multiple view relations , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[32]  John R. Smith,et al.  Searching for Images and Videos on the World-Wide Web , 1999 .

[33]  B. S. Manjunath,et al.  NeTra: A toolbox for navigating large image databases , 1997, Multimedia Systems.

[34]  Ricky Houghton Named Faces: Putting Names to Faces , 1999, IEEE Intell. Syst..

[35]  James Ze Wang,et al.  Multiresolution object-of-interest detection for images with low depth of field , 1999, Proceedings 10th International Conference on Image Analysis and Processing.

[36]  Joachim M. Buhmann,et al.  Empirical evaluation of dissimilarity measures for color and texture , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[37]  Rohini K. Srihari,et al.  Face detection and its applications in intelligent and focused image retrieval , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[38]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[39]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Narendra Ahuja,et al.  Face recognition using kernel eigenfaces , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[42]  Takeo Kanade,et al.  A statistical method for 3D object detection applied to faces and cars , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[43]  Norman I. Badler,et al.  A machine translation system from English to American Sign Language , 2000, AMTA.

[44]  Arnold W. M. Smeulders,et al.  PicToSeek: combining color and shape invariant features for image retrieval , 2000, IEEE Trans. Image Process..

[45]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[46]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[47]  David A. Forsyth,et al.  Mixtures of trees for object recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[48]  Ralph Gross,et al.  Quo vadis Face Recognition , 2001 .

[49]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[50]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[51]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[52]  Jiebo Luo,et al.  Performance-scalable computational approach to main-subject detection in photographs , 2001, IS&T/SPIE Electronic Imaging.

[53]  Jitendra Malik,et al.  Geometric blur for template matching , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[54]  Cordelia Schmid,et al.  Constructing models for content-based image retrieval , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[55]  David A. Forsyth,et al.  Clustering art , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[56]  Aya Soffer,et al.  PicASHOW: pictorial authority search by hyperlinks on the Web , 2001, WWW '01.

[57]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[58]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[59]  H. Cunningham,et al.  A framework and graphical development environment for robust NLP tools and applications , 2002, ACL.

[60]  H. J. Mclaughlin,et al.  Learn , 2002 .

[61]  Kwang In Kim,et al.  Face recognition using kernel principal component analysis , 2002, IEEE Signal Processing Letters.

[62]  Andrew W. Fitzgibbon,et al.  On Affine Invariant Clustering and Automatic Cast Listing in Movies , 2002, ECCV.

[63]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[64]  Ming-Hsuan Yang,et al.  Kernel Eigenfaces vs. Kernel Fisherfaces: Face recognition using kernel methods , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[65]  Jake K. Aggarwal,et al.  CIRES: a system for content-based retrieval in digital image libraries , 2002, 7th International Conference on Control, Automation, Robotics and Vision, 2002. ICARCV 2002..

[66]  P. Jonathon Phillips,et al.  Meta-analysis of face recognition algorithms , 2001, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[67]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[68]  Konstantinos N. Plataniotis,et al.  Face recognition using kernel direct discriminant analysis algorithms , 2003, IEEE Trans. Neural Networks.

[69]  R. Manmatha,et al.  Automatic Image Annotation and Retrieval using CrossMedia Relevance Models , 2003 .

[70]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[71]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[72]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[73]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[74]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[75]  Thomas Vetter,et al.  Face Recognition Based on Fitting a 3D Morphable Model , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[76]  Mingjing Li,et al.  Automated annotation of human faces in family albums , 2003, MULTIMEDIA '03.

[77]  David A. Forsyth,et al.  The effects of segmentation and feature choice in a translation model of object recognition , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[78]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[79]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[80]  SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 28 - August 1, 2003, Toronto, Canada , 2003, SIGIR.

[81]  Yuxiao Hu,et al.  Efficient propagation for face annotation in family albums , 2004, MULTIMEDIA '04.

[82]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83]  Philip H. S. Torr,et al.  The Development and Comparison of Robust Methods for Estimating the Fundamental Matrix , 1997, International Journal of Computer Vision.

[84]  Alex Pentland,et al.  Photobook: Content-based manipulation of image databases , 1996, International Journal of Computer Vision.

[85]  James Ze Wang,et al.  The story picturing engine: finding elite images to illustrate a story using mutual reinforcement , 2004, MIR '04.

[86]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[87]  Ching-Yung Lin,et al.  Cross-Modality Automatic Face Model Training from Large Video Databases , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[88]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[89]  Ralph Gross,et al.  Appearance-based face recognition and light-fields , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[90]  Yixin Chen,et al.  Image Categorization by Learning and Reasoning with Regions , 2004, J. Mach. Learn. Res..

[91]  B. S. Manjunath,et al.  Cortina: a system for large-scale, content-based web image retrieval , 2004, MULTIMEDIA '04.

[92]  Tamara L. Berg,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[93]  Pietro Perona,et al.  A Visual Category Filter for Google Images , 2004, ECCV.

[94]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[95]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[96]  Carla E. Brodley,et al.  Proceedings of the twenty-first international conference on Machine learning , 2004, International Conference on Machine Learning.

[97]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[98]  Alexander C. Berg,et al.  Who's In the Picture , 2004, NIPS 2004.

[99]  N. V. Vinodchandran,et al.  SVM-based generalized multiple-instance learning via approximate box counting , 2004, ICML.

[100]  Brian L. Evans,et al.  Unsupervised automation of photographic composition rules in digital still cameras , 2004, IS&T/SPIE Electronic Imaging.

[101]  Henning Schulzrinne,et al.  Proceedings of the 12th annual ACM international conference on Multimedia , 2004, MM 2004.

[102]  Erik G. Learned-Miller,et al.  Learning Hyper-Features for Visual Identification , 2004, NIPS.

[103]  Jun Yang,et al.  Naming every individual in news video monologues , 2004, MULTIMEDIA '04.

[104]  Jitendra Malik,et al.  Shape matching and object recognition using low distortion correspondences , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[105]  David A. Forsyth,et al.  Detecting, localizing and recovering kinematics of textured animals , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[106]  James Ze Wang,et al.  Content-based image retrieval: approaches and trends of the new age , 2005, MIR '05.

[107]  Keiji Yanai,et al.  Image region entropy: a measure of "visualness" of web images associated with one concept , 2005, MULTIMEDIA '05.

[108]  Mor Naaman,et al.  Leveraging context to resolve identity in photo albums , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[109]  Tao Qin,et al.  Web image clustering by consistent utilization of visual features and surrounding texts , 2005, MULTIMEDIA '05.

[110]  Jian Yang,et al.  KPCA plus LDA: a complete kernel Fisher discriminant framework for feature extraction and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[111]  Mark Craven,et al.  Supervised versus multiple instance learning: an empirical comparison , 2005, ICML.

[112]  Andrew Zisserman,et al.  Automatic face recognition for film character retrieval in feature-length films , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[113]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[114]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[115]  Antonio Torralba,et al.  Learning hierarchical models of scenes, objects, and parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[116]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[117]  Mohan S. Kankanhalli,et al.  Proceedings of the 13th annual ACM international conference on Multimedia , 2005, MM 2005.

[118]  Rong Yan,et al.  Multiple instance learning for labeling faces in broadcasting news video , 2005, MULTIMEDIA '05.

[119]  Kobus Barnard,et al.  Word Sense Disambiguation with Pictures , 2003, Artif. Intell..

[120]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[121]  Pinar Duygulu Sahin,et al.  A Graph Based Approach for Naming Faces in News Photos , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[122]  Bernt Schiele,et al.  Segmentation Based Multi-Cue Integration for Object Detection , 2006, BMVC.

[123]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[124]  Erik G. Learned-Miller,et al.  Discriminative Training of Hyper-feature Models for Object Identification , 2006, BMVC.

[125]  David A. Forsyth,et al.  Animals on the Web , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[126]  Tamara L. Berg,et al.  Automatic Ranking of Iconic Images , 2007 .

[127]  anonymous Evaluation report , 2019 .

[128]  Charlotte E. Erwin,et al.  Berg , 2022 .