Searching Personal Photos on the Phone with Instant Visual Query Suggestion and Joint Text-Image Hashing

The ubiquitous mobile devices have led to the unprecedented growing of personal photo collections on the phone. One significant pain point of today's mobile users is instantly finding specific photos of what they want. Existing applications (e.g., Google Photo and OneDrive) have predominantly focused on cloud-based solutions, while leaving the client-side challenges (e.g., query formulation, photo tagging and search, etc.) unsolved. This considerably hinders user experience on the phone. In this paper, we present an innovative personal photo search system on the phone, which enables instant and accurate photo search by visual query suggestion and joint text-image hashing. Specifically, the system is characterized by several distinctive properties: 1) visual query suggestion (VQS) to facilitate the formulation of queries in a joint text-image form, 2) light-weight convolutional and sequential deep neural networks to extract representations for both photos and queries, and 3) joint text-image hashing (with compact binary codes) to facilitate binary image search and VQS. It is worth noting that all the components run on the phone with client optimization by deep learning techniques. We have collected 270 photo albums taken by 30 mobile users (corresponding to 37,000 personal photos) and conducted a series of field studies. We show that our system significantly outperforms the existing client-based solutions by 10 x in terms of search efficiency, and 92.3% precision in terms of search accuracy, leading to a remarkably better user experience of photo discovery on the phone.

[1]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[4]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[5]  Jianmin Wang,et al.  Semantics-preserving hashing for cross-view retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yizhou Wang,et al.  Quantized Correlation Hashing for Fast Cross-Modal Search , 2015, IJCAI.

[7]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Philip S. Yu,et al.  Deep Visual-Semantic Hashing for Cross-Modal Retrieval , 2016, KDD.

[9]  Tieniu Tan,et al.  Deep semantic ranking based hashing for multi-label image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Meng Wang,et al.  Visual query suggestion , 2010, ACM Trans. Multim. Comput. Commun. Appl..

[11]  Tao Mei,et al.  Relaxing from Vocabulary: Robust Weakly-Supervised Deep Learning for Vocabulary-Free Image Tagging , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Shie Mannor,et al.  The cross entropy method for classification , 2005, ICML.

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  Tao Mei,et al.  Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval , 2016, IJCAI.

[16]  Meng Wang,et al.  Visual query suggestion , 2009, ACM Multimedia.

[17]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[18]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jianmin Wang,et al.  Deep Quantization Network for Efficient Image Retrieval , 2016, AAAI.

[20]  Andreas Girgensohn,et al.  Temporal event clustering for digital photo collections , 2003, ACM Multimedia.

[21]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[22]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[23]  Di Jiang,et al.  Personalized Query Suggestion With Diversity Awareness , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[24]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[26]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[27]  Zi Huang,et al.  Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[28]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Yao Hu,et al.  Iterative Multi-View Hashing for Cross Media Indexing , 2014, ACM Multimedia.

[30]  Alex Graves,et al.  Long Short-Term Memory , 2020, Computer Vision.

[31]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[32]  Jiebo Luo,et al.  Annotating photo collections by label propagation according to multiple similarity cues , 2008, ACM Multimedia.

[33]  Gang Wang,et al.  iRIN: image retrieval in image-rich information networks , 2010, WWW '10.

[34]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Meng Wang,et al.  Harvesting visual concepts for image search with complex queries , 2012, ACM Multimedia.

[37]  Michael R. Lyu,et al.  Diversifying Query Suggestion Results , 2010, AAAI.

[38]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Tao Mei,et al.  Image Tag Refinement With View-Dependent Concept Representations , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[40]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[41]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[42]  Zhou Yu,et al.  Discriminative coupled dictionary hashing for fast cross-media retrieval , 2014, SIGIR.

[43]  Tao Mei,et al.  Beyond Object Recognition: Visual Sentiment Analysis with Deep Coupled Adjective and Noun Neural Networks , 2016, IJCAI.

[44]  Hong-Yuan Mark Liao,et al.  Discovering informative social subgraphs and predicting pairwise relationships from group photos , 2012, ACM Multimedia.

[45]  Qiang Yang,et al.  Scalable heterogeneous translated hashing , 2014, KDD.

[46]  Tao Mei,et al.  Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks , 2017, AAAI.

[47]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Tao Mei,et al.  Tagging Personal Photos with Transfer Deep Learning , 2015, WWW.

[49]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[50]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[51]  Amarnath Gupta,et al.  Social life networks: a multimedia problem? , 2013, MM '13.

[52]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[53]  Ramesh C. Jain,et al.  Effective summarization of large collections of personal photos , 2011, WWW.

[54]  Qi Tian,et al.  Multimedia search reranking: A literature survey , 2014, CSUR.

[55]  Jakob Grue Simonsen,et al.  A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion , 2015, CIKM.