Interactive Multimodal Visual Search on Mobile Device

This paper describes a novel multimodal interactive image search system on mobile devices. The system, the Joint search with ImaGe, Speech, And Word Plus (JIGSAW<formula formulatype="inline"><tex Notation="TeX">${+}$</tex> </formula>), takes full advantage of the multimodal input and natural user interactions of mobile devices. It is designed for users who already have pictures in their minds but have no precise descriptions or names to address them. By describing it using speech and then refining the recognized query by interactively composing a visual query using exemplary images, the user can easily find the desired images through a few natural multimodal interactions with his/her mobile device. Compared with our previous work JIGSAW, the algorithm has been significantly improved in three aspects: 1) segmentation-based image representation is adopted to remove the artificial block partitions; 2) relative position checking replaces the fixed position penalty; and 3) inverted index is constructed instead of brute force matching. The proposed JIGSAW<formula formulatype="inline"><tex Notation="TeX">${+}$</tex></formula> is able to achieve 5% gain in terms of search performance and is ten times faster.

[1]  Bo Zhang,et al.  An efficient and effective region-based image retrieval framework , 2004, IEEE Transactions on Image Processing.

[2]  Changhu Wang,et al.  MindFinder: image search by interactive sketching and tagging , 2010, WWW '10.

[3]  Bernd Girod,et al.  CHoG: Compressed histogram of gradients A low bit-rate feature descriptor , 2009, CVPR.

[4]  Meng Wang,et al.  Visual query suggestion , 2009, ACM Multimedia.

[5]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[6]  Xianglong Liu,et al.  Search by mobile image based on visual and spatial consistency , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[7]  Liqing Zhang,et al.  MindFinder: interactive sketch-based image search on millions of images , 2010, ACM Multimedia.

[8]  Boris Babenko,et al.  ImprovingWeb-based Image Search via Content Based Clustering , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[9]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[10]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[11]  Kristen Grauman,et al.  Boundary Preserving Dense Local Regions , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Mingjing Li Texture Moment for Content-Based Image Retrieval , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[13]  Bernd Girod,et al.  Comparison of local feature descriptors for mobile visual search , 2010, 2010 IEEE International Conference on Image Processing.

[14]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[15]  Barry Smyth,et al.  Mobile information access: A study of emerging search behavior on the mobile Internet , 2007, TWEB.

[16]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[17]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Huizhong Chen,et al.  Combining image and text features: a hybrid approach to mobile book spine recognition , 2011, ACM Multimedia.

[19]  Xing Xie,et al.  Photo-to-Search: Using Camera Phones to Inquire of the Surrounding World , 2006, 7th International Conference on Mobile Data Management (MDM'06).

[20]  Yang Wang,et al.  JIGSAW: interactive mobile visual search with multimodal queries , 2011, ACM Multimedia.

[21]  Hao Xu,et al.  Image search by concept map , 2010, SIGIR '10.

[22]  Douglas Lanman,et al.  BiDi screen: a thin, depth-sensing LCD for 3D interaction using light fields , 2009, SIGGRAPH 2009.

[23]  Marc Alexa,et al.  Sketch-Based Image Retrieval: Benchmark and Bag-of-Features Descriptors , 2011, IEEE Transactions on Visualization and Computer Graphics.

[24]  Shi-Min Hu,et al.  Sketch2Photo: internet image montage , 2009, ACM Trans. Graph..

[25]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[26]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[27]  Huizhong Chen,et al.  Mobile visual search on printed documents using text and low bit-rate features , 2011, 2011 18th IEEE International Conference on Image Processing.

[28]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[29]  Marc Alexa,et al.  An evaluation of descriptors for large-scale image retrieval from sketched feature lines , 2010, Comput. Graph..

[30]  Alexei A. Efros,et al.  Recovering Surface Layout from an Image , 2007, International Journal of Computer Vision.

[31]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[32]  Bernd Girod,et al.  Mobile Visual Search , 2011, IEEE Signal Processing Magazine.

[33]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .