论文信息 - Physical querying with multi-modal sensing

Physical querying with multi-modal sensing

We present Marvin, a system that can search physical objects using a mobile or wearable device. It integrates HOG-based object recognition, SURF-based localization information, automatic speech recognition, and user feedback information with a probabilistic model to recognize the “object of interest” at high accuracy and at interactive speeds. Once the object of interest is recognized, the information that the user is querying, e.g. reviews, options, etc., is displayed on the user's mobile or wearable device. We tested this prototype in a real-world retail store during business hours, with varied degree of background noise and clutter. We show that this multi-modal approach achieves superior recognition accuracy compared to using a vision system alone, especially in cluttered scenes where a vision system would be unable to distinguish which object is of interest to the user without additional input. It is computationally able to scale to large numbers of objects by focusing compute-intensive resources on the objects most likely to be of interest, inferred from user speech and implicit localization information. We present the system architecture, the probabilistic model that integrates the multi-modal information, and empirical results showing the benefits of multi-modal integration.

[1] Alexander I. Rudnicky,et al. Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2] Larry S. Davis,et al. Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers , 2008, ECCV.

[3] Alexei A. Efros,et al. Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[4] David A. Forsyth,et al. Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[5] Ben Taskar,et al. Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[6] Yaser Sheikh,et al. 3D Social Saliency from Head-mounted Cameras , 2012, NIPS.

[7] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Larry S. Davis,et al. Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Morgan Quigley,et al. ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[10] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).