Discovering Mid-level Visual Connections in Space and Time

Finding recurring visual patterns in data underlies much of modern computer vision. The emerging subfield of visual category discovery/visual data mining proposes to cluster visual patterns that capture more complex appearance than low-level blobs, corners, or oriented bars, without requiring any semantic labels. In particular, mid-level visual elements have recently been proposed as a new type of visual primitive, and have been shown to be useful for various recognition tasks. The visual elements are discovered automatically from the data, and thus, have a flexible representation of being either a part, an object, a group of objects, etc. In this chapter, we explore what the mid-level visual representation brings to geo-spatial and longitudinal analyses. Specifically, we present a weakly supervised visual data mining approach that discovers connections between recurring mid-level visual elements in historic (temporal) and geographic (spatial) image collections, and attempts to capture the underlying visual style. In contrast to existing discovery methods that mine for patterns that remain visually consistent throughout the dataset, the goal is to discover visual elements whose appearance changes due to change in time or location, i.e., exhibit consistent stylistic variations across the label space (date or geo-location). To discover these elements, we first identify groups of patches that are style-sensitive. We then incrementally build correspondences to find the same element across the entire dataset. Finally, we train style-aware regressors that model each element’s range of stylistic differences. We apply our approach to date and geo-location prediction and show substantial improvement over several baselines that do not model visual style. We also demonstrate the method’s effectiveness on the related task of fine-grained classification.

[1]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[2]  Trevor Darrell,et al.  Unsupervised Learning of Categories from Sets of Partially Matching Image Features , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  Arnold W. M. Smeulders,et al.  Fine-Grained Categorization by Alignments , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Larry S. Davis,et al.  Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance , 2011, 2011 International Conference on Computer Vision.

[5]  Eric P. Xing,et al.  Modeling and Analysis of Dynamic Behaviors of Web Image Collections , 2010, ECCV.

[6]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Ali Farhadi,et al.  Attribute Discovery via Predictable Discriminative Binary Codes , 2012, ECCV.

[8]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[9]  Yong Jae Lee,et al.  Object-graphs for context-aware category discovery , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Trevor Darrell,et al.  Pose pooling kernels for sub-category recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Fei-Fei Li,et al.  Combining randomization and discrimination for fine-grained image categorization , 2011, CVPR 2011.

[12]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[13]  Kun Duan,et al.  Discovering localized attributes for fine-grained recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Richard Szeliski,et al.  City-Scale Location Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Pietro Perona,et al.  Multiclass recognition and part localization with humans in the loop , 2011, 2011 International Conference on Computer Vision.

[16]  Alexei A. Efros,et al.  IM2GPS: estimating geographic information from a single image , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[19]  Linda G. Shapiro,et al.  Unsupervised Template Learning for Fine-Grained Object Recognition , 2012, NIPS.

[20]  Yong Jae Lee,et al.  Object-Graphs for Context-Aware Visual Category Discovery , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Sinisa Todorovic,et al.  From a Set of Shapes to Object Discovery , 2010, ECCV.

[22]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23]  Abhinav Gupta,et al.  Constrained Semi-Supervised Learning Using Attributes and Comparative Attributes , 2012, ECCV.

[24]  Tomás Pajdla,et al.  Avoiding Confusing Features in Place Recognition , 2010, ECCV.

[25]  Michal Irani,et al.  “Clustering by Composition”—Unsupervised Discovery of Image Categories , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Kristen Grauman,et al.  Clues from the beaten path: Location estimation with bursty sequences of tourist photos , 2011, CVPR 2011.

[27]  Andrew Zisserman,et al.  Symbiotic Segmentation and Part Localization for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Yun Fu,et al.  Age Synthesis and Estimation via Faces: A Survey , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[30]  Alexei A. Efros,et al.  Image sequence geolocation with human travel priors , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[31]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[32]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[34]  Peter N. Belhumeur,et al.  POOF: Part-Based One-vs.-One Features for Fine-Grained Categorization, Face Verification, and Attribute Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Yong Jae Lee,et al.  Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Yong Jae Lee,et al.  Foreground Focus: Unsupervised Learning from Partially Matching Images , 2009, International Journal of Computer Vision.

[37]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[38]  Alexei A. Efros,et al.  Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships , 2009, NIPS.

[39]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[40]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[41]  Alexei A. Efros,et al.  Dating Historical Color Images , 2012, ECCV.

[42]  Alessandro Perina,et al.  Geo-located image analysis using latent representations , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.