The effects of segmentation and feature choice in a translation model of object recognition

We work with a model of object recognition where words must be placed on image regions. This approach means that large scale experiments are relatively easy, so we can evaluate the effects of various early and midlevel vision algorithms on recognition performance. We evaluate various image segmentation algorithms by determining word prediction accuracy for images segmented in various ways and represented by various features. We take the view that good segmentations respect object boundaries, and so word prediction should be better for a better segmentation. However, it is usually very difficult in practice to obtain segmentations that do not break up objects, so most practitioners attempt to merge segments to get better putative object representations. We demonstrate that our paradigm of word prediction easily allows us to predict potentially useful segment merges, even for segments that do not look similar (for example, merging the black and white halves of a penguin is not possible with feature-based segmentation; the main cue must be "familiar configuration"). These studies focus on unsupervised learning of recognition. However, we show that word prediction can be markedly improved by providing supervised information for a relatively small number of regions together with large quantities of unsupervised information. This supervisory information allows a better and more discriminative choice of features and breaks possible symmetries.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Dana H. Ballard,et al.  Computer Vision , 1982 .

[3]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[4]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[5]  Venu Govindaraju,et al.  Use of Collateral Text in Image Interpretation , 1994 .

[6]  Debra T. Burhans,et al.  Visual Semantics: Extracting Visual information from Text Accompanying Pictures , 1994, AAAI.

[7]  Neill W. Campbell,et al.  Interpreting image databases by region classification , 1997, Pattern Recognit..

[8]  Neill W. Campbell,et al.  Automatic Segmentation and Classification of Outdoor Images Using Neural Networks , 1997, Int. J. Neural Syst..

[9]  Sudeep Sarkar,et al.  A Framework for Performance Characterization of Intermediate-Level Grouping Modules , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[12]  S. Sclaroff,et al.  Combining textual and visual cues for content-based image retrieval on the World Wide Web , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[13]  Hayit Greenspan,et al.  Color- and Texture-based Image Segmentation Using the Expectation-Maximization Algorithm and its Application to Content-Based Image Retrieval. , 1998, ICCV 1998.

[14]  Jitendra Malik,et al.  Color- and texture-based image segmentation using EM and its application to content-based image retrieval , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[15]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[16]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[17]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[18]  Mitchell Marcus,et al.  Empirical Methods for Exploiting Parallel Texts , 2001 .

[19]  David A. Forsyth,et al.  Clustering art , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[20]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[21]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Jitendra Malik,et al.  Learning to Detect Natural Image Boundaries Using Brightness and Texture , 2002, NIPS.

[23]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..