JIGSAW: interactive mobile visual search with multimodal queries

The traditional text-based visual search has not been sufficiently improved over the years to accommodate the new emerging demand of mobile users. While on the go, searching on one's phone is becoming pervasive. This paper presents an innovative application for mobile phone users to facilitate their visual search experience. By taking advantage of smart phone functionalities such as multi-modal and multi-touch interactions, users can more conveniently formulate their search intent, and thus search performance can be significantly improved. The system, called JIGSAW (Joint search with ImaGe, Speech, And Words), represents one of the first attempts to create an interactive and multi-modal mobile visual search application. The key of JIGSAW is the composition of an exemplary image query generated from the raw speech via multi-touch user interaction, as well as the visual search based on the exemplary image. Through JIGSAW, users can formulate their search intent in a natural way like playing a jigsaw puzzle on the phone screen: 1) a user speaks a natural sentence as the query, 2) the speech is recognized and transferred to text which is further decomposed to keywords through entity extraction, 3) the user selects preferred exemplary images that can visually represent his/her intent and composes a query image via multi-touch, and 4) the composite image is then used as a visual query to search similar images. We have deployed JIGSAW on a real-world phone system, evaluated the performance on one million images, and demonstrated that it is an effective complement to existing mobile visual search applications.

[1]  HongJiang Zhang,et al.  Contrast-based image attention analysis by using fuzzy growing , 2003, MULTIMEDIA '03.

[2]  Bernd Girod,et al.  CHoG: Compressed histogram of gradients A low bit-rate feature descriptor , 2009, CVPR.

[3]  Won-Keun Yang,et al.  Image Description and Matching Scheme for Identical Image Searching , 2009, 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns.

[4]  Xiao Li,et al.  Understanding the Semantic Structure of Noun Phrase Queries , 2010, ACL.

[5]  Tao Mei,et al.  CrowdReranking: exploring multiple search engines for visual search reranking , 2009, SIGIR.

[6]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[7]  Xing Xie,et al.  Photo-to-Search: Using Camera Phones to Inquire of the Surrounding World , 2006, 7th International Conference on Mobile Data Management (MDM'06).

[8]  Hao Xu,et al.  Image search by concept map , 2010, SIGIR '10.

[9]  Bernd Girod,et al.  Low latency image retrieval with progressive transmission of CHoG descriptors , 2010, MCMC '10.

[10]  Satoshi Sekine,et al.  Extended Named Entity Hierarchy , 2002, LREC.

[11]  Bernd Girod,et al.  Outdoors augmented reality on mobile phone using loxel-based visual feature organization , 2008, MIR '08.

[12]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[14]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[16]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[17]  Liqing Zhang,et al.  MindFinder: interactive sketch-based image search on millions of images , 2010, ACM Multimedia.

[18]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[19]  Xian-Sheng Hua,et al.  Interactive Image Search by Color Map , 2011, TIST.

[20]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[21]  Barry Smyth,et al.  Mobile information access: A study of emerging search behavior on the mobile Internet , 2007, TWEB.

[22]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[23]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.