Increasing the bandwidth of crowdsourced visual question answering to better support blind users

Many of the visual questions that blind people ask cannot be easily answered with a single image or a short response, especially when questions are of an exploratory nature, e.g. what is in this area, or what tools are available on this work bench? We introduce RegionSpeak to allow blind users to capture large areas of visual information, identify all of the objects within them, and explore their spatial layout with fewer interactions. RegionSpeak helps blind users capture all of the relevant visual information using an interface designed to support stitching multiple images together. We use a parallel crowdsourcing workflow that asks workers to define and describe regions of interest, allowing even complex images to be described quickly. The regions and descriptions are displayed on an auditory touchscreen interface, allowing users to know what is in a scene and how it is laid out.

[1]  Zdenek Kalal,et al.  Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Walter S. Lasecki,et al.  Answering visual questions with conversational crowd assistants , 2013, ASSETS.

[3]  Rob Miller,et al.  VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[4]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.