Fast Object Class Labelling via Speech

Object class labelling is the task of annotating images with labels on the presence or absence of objects from a given class vocabulary. Simply asking one yes-no question per class, however, has a cost that is linear in the vocabulary size and is thus inefficient for large vocabularies. Modern approaches rely on a hierarchical organization of the vocabulary to reduce annotation time, but remain expensive (several minutes per image for the 200 classes in ILSVRC). Instead, we propose a new interface where classes are annotated via speech. Speaking is fast and allows for direct access to the class name, without searching through a list or hierarchy. As additional advantages, annotators can simultaneously speak and scan the image for objects, the interface can be kept extremely simple, and using it requires less mouse movement. As annotators using our interface should only say words from a given class vocabulary, we propose a dedicated task to train them to do so. Through experiments on COCO and ILSVRC, we show our method yields high-quality annotations at 2.3x −14.9x less annotation time than existing methods.

[1]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[2]  Ronald A. Rensink,et al.  Rapid Resumption of Interrupted Visual Search , 2005, Psychological science.

[3]  Santiago Manen,et al.  PathTrack: Fast Trajectory Annotation with Path Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Krista A. Ehinger,et al.  Modelling search for people in 900 scenes: A combined source model of eye guidance , 2009 .

[5]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[6]  Michael S. Bernstein,et al.  Embracing Error to Enable Rapid Crowdsourcing , 2016, CHI.

[7]  K. Rayner The 35th Sir Frederick Bartlett Lecture: Eye movements and attention in reading, scene perception, and visual search , 2009, Quarterly journal of experimental psychology.

[8]  Sharon Oviatt,et al.  Integration and synchronization of input modes during multimodal human-computer interaction , 1997 .

[9]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[10]  Kent Lyons,et al.  An empirical study of typing rates on mini-QWERTY keyboards , 2005, CHI Extended Abstracts.

[11]  Derrick G. Watson,et al.  Eye movements and time-based selection: Where do the eyes go in preview search? , 2007, Psychonomic bulletin & review.

[12]  Dengxin Dai Towards Cost-Effective and Performance-Aware Vision Algorithms , 2016 .

[13]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Luc Van Gool,et al.  Object Referring in Visual Scene with Spoken Language , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  R. Pausch An Empirical Study : Adding Voice Input to a Graphical Editor , 1991 .

[17]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[18]  Frank Keller,et al.  Training Object Class Detectors from Eye Tracking Data , 2014, ECCV.

[19]  Sharon Oviatt,et al.  Multimodal Interfaces , 2008, Encyclopedia of Multimedia.

[20]  D. Kahneman Attention and Effort , 1973 .

[21]  Fei-Fei Li,et al.  Best of both worlds: Human-machine collaboration for object annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Frank Keller,et al.  Training Object Class Detectors with Click Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Michael S. Bernstein,et al.  Scalable multi-label annotation , 2014, CHI.

[24]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[25]  Daniel B. Horn,et al.  Patterns of entry and correction in large vocabulary continuous speech recognition systems , 1999, CHI '99.

[26]  Sharon L. Oviatt,et al.  Multimodal interfaces for dynamic interactive maps , 1996, CHI.

[27]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[28]  Cees Snoek,et al.  Spot On: Action Localization from Pointly-Supervised Proposals , 2016, ECCV.

[29]  Alexander G. Hauptmann,et al.  Speech and gestures for graphic image manipulation , 1989, CHI '89.

[30]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[31]  Jeff B. Pelz,et al.  SNAG: Spoken Narratives and Gaze Dataset , 2018, ACL.

[32]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[33]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[34]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[35]  Mark W. Schmidt,et al.  Where are the Blobs: Counting by Localization with Point Supervision , 2018, ECCV.

[36]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.