Visual Turing test for computer vision systems

Significance In computer vision, as in other fields of artificial intelligence, the methods of evaluation largely define the scientific effort. Most current evaluations measure detection accuracy, emphasizing the classification of regions according to objects from a predefined library. But detection is not the same as understanding. We present here a different evaluation system, in which a query engine prepares a written test (“visual Turing test”) that uses binary questions to probe a system’s ability to identify attributes and relationships in addition to recognizing objects. Today, computer vision systems are tested by their accuracy in detecting and localizing instances of objects. As an alternative, and motivated by the ability of humans to provide far richer descriptions and even tell a story about an image, we construct a “visual Turing test”: an operator-assisted device that produces a stochastic sequence of binary questions from a given test image. The query engine proposes a question; the operator either provides the correct answer or rejects the question as ambiguous; the engine proposes the next question (“just-in-time truthing”). The test is then administered to the computer-vision system, one question at a time. After the system’s answer is recorded, the system is provided the correct answer and the next question. Parsing is trivial and deterministic; the system being tested requires no natural language processing. The query engine employs statistical constraints, learned from a training set, to produce questions with essentially unpredictable answers—the answer to a question, given the history of questions and their correct answers, is nearly equally likely to be positive or negative. In this sense, the test is only about vision. The system is designed to produce streams of questions that follow natural story lines, from the instantiation of a unique object, through an exploration of its properties, and on to its relationships with other uniquely instantiated objects.

[1]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[2]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[3]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[4]  Pietro Perona,et al.  A Bayesian approach to unsupervised one-shot learning of object categories , 2003, ICCV 2003.

[5]  Pietro Perona,et al.  A Bayesian approach to unsupervised one-shot learning of object categories , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[7]  Varol Akman,et al.  Turing Test: 50 Years Later , 2000, Minds and Machines.

[8]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[9]  Mei-Chen Yeh,et al.  Fast Human Detection Using a Cascade of Histograms of Oriented Gradients , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11]  Yali Amit,et al.  POP: Patchwork of Parts Models for Object Recognition , 2007, International Journal of Computer Vision.

[12]  Joachim M. Buhmann,et al.  Learning Top-Down Grouping of Compositional Hierarchies for Recognition , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[13]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[14]  Benjamin Z. Yao,et al.  Introduction to a Large-Scale General Purpose Ground Truth Database: Methodology, Annotation Tool and Benchmarks , 2007, EMMCVPR.

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  Jean-Michel Morel,et al.  A fully affine invariant image comparison method , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[18]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Ali Farhadi,et al.  The benefits and challenges of collecting richer object annotations , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[20]  Wei Zhang,et al.  Context, Computation, and Optimal ROC Performance in Hierarchical Models , 2011, International Journal of Computer Vision.

[21]  Selim Aksoy,et al.  Performance measures for object detection evaluation , 2010, Pattern Recognit. Lett..

[22]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[27]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Alan L. Yuille,et al.  Parsing Semantic Parts of Cars Using Graphical Models and Segment Appearance Consistency , 2014, BMVC.