Tagging Personal Photos with Transfer Deep Learning

The advent of mobile devices and media cloud services has led to the unprecedented growing of personal photo collections. One of the fundamental problems in managing the increasing number of photos is automatic image tagging. Existing research has predominantly focused on tagging general Web images with a well-labelled image database, e.g., ImageNet. However, they can only achieve limited success on personal photos due to the domain gaps between personal photos and Web images. These gaps originate from the differences in semantic distribution and visual appearance. To deal with these challenges, in this paper, we present a novel transfer deep learning approach to tag personal photos. Specifically, to solve the semantic distribution gap, we have designed an ontology consisting of a hierarchical vocabulary tailored for personal photos. This ontology is mined from $10,000$ active users in Flickr with 20 million photos and 2.7 million unique tags. To deal with the visual appearance gap, we discover the intermediate image representations and ontology priors by deep learning with bottom-up and top-down transfers across two domains, where Web images are the source domain and personal photos are the target. Moreover, we present two modes (single and batch-modes) in tagging and find that the batch-mode is highly effective to tag photo collections. We conducted personal photo tagging on 7,000 real personal photos and personal photo search on the MIT-Adobe FiveK photo dataset. The proposed tagging approach is able to achieve a performance gain of $12.8\%$ and $4.5\%$ in terms of NDCG@5, against the state-of-the-art hand-crafted feature-based and deep learning-based methods, respectively.

[1]  Weidong Yang,et al.  Labeling Images by Integrating Sparse Multiple Distance Learning and Semantic Context Modeling , 2012, ECCV.

[2]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[3]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[4]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[5]  Dong Liu,et al.  Tag ranking , 2009, WWW '09.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Yi Liu,et al.  Large-scale image annotation using visual synset , 2011, 2011 International Conference on Computer Vision.

[8]  Sylvain Paris,et al.  Learning photographic global tonal adjustment with a database of input / output image pairs , 2011, CVPR 2011.

[9]  Tao Mei,et al.  Contextual Bag-of-Words for Visual Categorization , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Nitish Srivastava,et al.  Discriminative Transfer Learning with Tree-based Priors , 2013, NIPS.

[11]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Changsheng Li,et al.  Learning ordinal discriminative features for age estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Feiping Nie,et al.  Dyadic transfer learning for cross-domain image classification , 2011, 2011 International Conference on Computer Vision.

[14]  Wei-Ying Ma,et al.  AnnoSearch: Image Auto-Annotation by Search , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[16]  Marcel Worring,et al.  Learning tag relevance by neighbor voting for social image retrieval , 2008, MIR '08.

[17]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[18]  Carman Neustaedter,et al.  Image annotation using personal calendars as context , 2008, ACM Multimedia.

[19]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[21]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[22]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[25]  Tao Mei,et al.  Image Tag Refinement With View-Dependent Concept Representations , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  Jiebo Luo,et al.  Annotating collections of photos using hierarchical event and scene models , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[28]  Ying He,et al.  Mining social images with distance metric learning for automated image tagging , 2011, WSDM '11.

[29]  Jiebo Luo,et al.  Event recognition from photo collections via PageRank , 2009, MM '09.

[30]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).