The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization

We present a visual recognition system for fine-grained visual categorization. The system is composed of a human and a machine working together and combines the complementary strengths of computer vision algorithms and (non-expert) human users. The human users provide two heterogeneous forms of information object part clicks and answers to multiple choice questions. The machine intelligently selects the most informative question to pose to the user in order to identify the object class as quickly as possible. By leveraging computer vision and analyzing the user responses, the overall amount of human effort required, measured in seconds, is minimized. Our formalism shows how to incorporate many different types of computer vision algorithms into a human-in-the-loop framework, including standard multiclass methods, part-based methods, and localized multiclass and attribute methods. We explore our ideas by building a field guide for bird identification. The experimental results demonstrate the strength of combining ignorant humans with poor-sighted machines the hybrid system achieves quick and accurate bird identification on a dataset containing 200 bird species.

[1]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[2]  Marin Ferecatu,et al.  A Statistical Framework for Image Category Search from a Mental Picture , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Cordelia Schmid,et al.  Combining attributes and Fisher vectors for efficient image retrieval , 2011, CVPR 2011.

[4]  Alfred O. Hero,et al.  A collaborative 20 questions model for target search with human-machine interaction , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Qiang Yang,et al.  A unified framework for semantics and feature based relevance feedback in image retrieval systems , 2000, ACM Multimedia.

[6]  Kristen Grauman,et al.  Large-scale live active learning: Training object detectors with crawled data and crowds , 2011, CVPR.

[7]  Pietro Perona,et al.  Multiclass recognition and part localization with humans in the loop , 2011, 2011 International Conference on Computer Vision.

[8]  Trevor Darrell,et al.  Pose pooling kernels for sub-category recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Deva Ramanan,et al.  Video Annotation and Tracking with Active Learning , 2011, NIPS.

[10]  James J. Little,et al.  Fine-Grained Categorization for 3D Scene Understanding , 2012, BMVC.

[11]  Devi Parikh,et al.  Attributes for Classifier Feedback , 2012, ECCV.

[12]  Thomas S. Huang,et al.  Relevance feedback in image retrieval: A comprehensive review , 2003, Multimedia Systems.

[13]  Yuchun Fang,et al.  Experiments in Mental Face Retrieval , 2005, AVBPA.

[14]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  C. Mervis,et al.  Order of acquisition of subordinate-, basic-, and superordinate-level categories. , 1982 .

[16]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[17]  Andrew Zisserman,et al.  Symbiotic Segmentation and Part Localization for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Thomas G. Dietterich,et al.  Haar Random Forest Features and SVM Spatial Matching Kernel for Stonefly Species Identification , 2010, 2010 20th International Conference on Pattern Recognition.

[19]  Nuno Vasconcelos,et al.  Bridging the Gap: Query by Semantic Example , 2007, IEEE Transactions on Multimedia.

[20]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[22]  W. John Kress,et al.  Leafsnap: A Computer Vision System for Automatic Plant Species Identification , 2012, ECCV.

[23]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[24]  Kristen Grauman,et al.  Interactively building a discriminative vocabulary of nameable attributes , 2011, CVPR 2011.

[25]  Vaishali D. Dhale Relevance Feedback for Image Retrieval , 2013 .

[26]  Daniel P. Huttenlocher,et al.  Efficient matching of pictorial structures , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[27]  I. Biederman,et al.  Subordinate-level object classification reexamined , 1999, Psychological research.

[28]  Peter N. Belhumeur,et al.  POOF: Part-Based One-vs.-One Features for Fine-Grained Categorization, Face Verification, and Attribute Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Arnold W. M. Smeulders,et al.  Fine-Grained Categorization by Alignments , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Thomas G. Dietterich,et al.  Dictionary-free categorization of very similar objects via stacked evidence trees , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Gary R. Bradski,et al.  A codebook-free and annotation-free approach for fine-grained image categorization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Wayne D. Gray,et al.  Basic objects in natural categories , 1976, Cognitive Psychology.

[34]  Deva Ramanan,et al.  Efficiently Scaling Up Video Annotation with Crowdsourced Marketplaces , 2010, ECCV.

[35]  Shree K. Nayar,et al.  FaceTracer: A Search Engine for Large Collections of Images with Faces , 2008, ECCV.

[36]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[37]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Mark Everingham,et al.  Shared parts for deformable part-based models , 2011, CVPR 2011.

[39]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[40]  Jeff Donahue,et al.  Annotator rationales for visual recognition , 2011, 2011 International Conference on Computer Vision.

[41]  Peter I. Frazier,et al.  Twenty Questions with Noise: Bayes Optimal Policies for Entropy Loss , 2012, Journal of Applied Probability.

[42]  C. V. Jawahar,et al.  The truth about cats and dogs , 2011, 2011 International Conference on Computer Vision.

[43]  C. Lawrence Zitnick,et al.  Finding the weakest link in person detectors , 2011, CVPR 2011.

[44]  Kun Duan,et al.  Discovering localized attributes for fine-grained recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Gang Wang,et al.  Joint learning of visual attributes, object classes and visual saliency , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[46]  David W. Jacobs,et al.  Dog Breed Classification Using Part Localization , 2012, ECCV.

[47]  Subhransu Maji Discovering a Lexicon of Parts and Attributes , 2012, ECCV Workshops.

[48]  Ingemar J. Cox,et al.  The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments , 2000, IEEE Trans. Image Process..

[49]  Russell H. Taylor,et al.  Unified Detection and Tracking in Retinal Microsurgery , 2011, MICCAI.

[50]  Pietro Perona,et al.  Strong supervision from weak annotation: Interactive training of deformable part models , 2011, 2011 International Conference on Computer Vision.

[51]  Donald Geman,et al.  Shape Recognition and Twenty Questions , 2007 .

[52]  Shree K. Nayar,et al.  Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[53]  Dani Lischinski,et al.  A Closed-Form Solution to Natural Image Matting , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  VijayanarasimhanSudheendra,et al.  Large-Scale Live Active Learning , 2014 .

[55]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Thomas S. Huang,et al.  Image processing , 1971 .

[57]  Katja Markert,et al.  Learning Models for Object Recognition from Natural Language Descriptions , 2009, BMVC.

[58]  Donald Geman,et al.  An Active Testing Model for Tracking Roads in Satellite Images , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[59]  Pietro Perona,et al.  Visual Recognition with Humans in the Loop , 2010, ECCV.

[60]  Andrew Zisserman,et al.  A Visual Vocabulary for Flower Classification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[61]  Subhransu Maji,et al.  Part Annotations via Pairwise Correspondence , 2012, HCOMP@AAAI.

[62]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[63]  Wen Wu,et al.  SmartLabel: an object labeling tool using iterated harmonic energy minimization , 2006, MM '06.

[64]  Andrew Zisserman,et al.  BiCoS: A Bi-level co-segmentation method for image classification , 2011, 2011 International Conference on Computer Vision.

[65]  Kristen Grauman,et al.  Implied Feedback: Learning Nuances of User Behavior in Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[66]  Cordelia Schmid,et al.  A maximum entropy framework for part-based texture and object recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[67]  Raphael Sznitman,et al.  Active Testing for Face Detection and Localization , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  Mark Craven,et al.  Curious machines: active learning with structured instances , 2008 .

[69]  Kristen Grauman,et al.  What's it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[71]  Devi Parikh Human-Debugging of Machines , 2011 .