Deep Learning for Content-Based Image Retrieval: A Comprehensive Study

Learning effective feature representations and similarity measures are crucial to the retrieval performance of a content-based image retrieval (CBIR) system. Despite extensive research efforts for decades, it remains one of the most challenging open problems that considerably hinders the successes of real-world CBIR systems. The key challenge has been attributed to the well-known ``semantic gap'' issue that exists between low-level image pixels captured by machines and high-level semantic concepts perceived by human. Among various techniques, machine learning has been actively investigated as a possible direction to bridge the semantic gap in the long term. Inspired by recent successes of deep learning techniques for computer vision and other applications, in this paper, we attempt to address an open problem: if deep learning is a hope for bridging the semantic gap in CBIR and how much improvements in CBIR tasks can be achieved by exploring the state-of-the-art deep learning techniques for learning feature representations and similarity measures. Specifically, we investigate a framework of deep learning with application to CBIR tasks with an extensive set of empirical studies by examining a state-of-the-art deep learning method (Convolutional Neural Networks) for CBIR tasks under varied settings. From our empirical studies, we find some encouraging results and summarize some important insights for future research.

[1]  Inderjit S. Dhillon,et al.  Online Metric Learning and Fast Similarity Search , 2008, NIPS.

[2]  David J. Fleet,et al.  Hamming Distance Metric Learning , 2012, NIPS.

[3]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[4]  Wei Liu,et al.  Learning Distance Metrics with Contextual Constraints for Image Retrieval , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Yongdong Zhang,et al.  Scalable Similarity Search With Topology Preserving Hashing , 2014, IEEE Transactions on Image Processing.

[6]  James M. Rehg,et al.  CENTRIST: A Visual Descriptor for Scene Categorization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Lei Wu,et al.  Enhancing Bag-of-Words Models with Semantics-Preserving Metric Learning , 2011, IEEE MultiMedia.

[8]  Li Deng,et al.  A tutorial survey of architectures, algorithms, and applications for deep learning , 2014, APSIPA Transactions on Signal and Information Processing.

[9]  Nenghai Yu,et al.  Semantics-Preserving Bag-of-Words Models and Applications , 2010, IEEE Transactions on Image Processing.

[10]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Anil K. Jain,et al.  Image retrieval using color and shape , 1996, Pattern Recognit..

[12]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Ajmal S. Mian,et al.  Image Set Based Face Recognition Using Self-Regularized Non-Negative Coding and Adaptive Distance Metric Learning , 2013, IEEE Transactions on Image Processing.

[14]  Luca Maria Gambardella,et al.  Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images , 2012, NIPS.

[15]  Enrique G. Ortiz,et al.  Evaluating Open-Universe Face Identification on the Web , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[16]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[18]  Cordelia Schmid,et al.  Is that you? Metric learning approaches for face identification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[20]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[21]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[22]  Shree K. Nayar,et al.  Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[23]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[24]  Liang-Tien Chia,et al.  Learning image-to-class distance metric for image classification , 2013, TIST.

[25]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[26]  Yongdong Zhang,et al.  A Prior-Free Weighting Scheme for Binary Code Ranking , 2014, IEEE Transactions on Multimedia.

[27]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[28]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Rong Jin,et al.  Rank-based distance metric learning: An application to image retrieval , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[31]  Chunyan Miao,et al.  Online multimodal deep similarity learning with application to image retrieval , 2013, ACM Multimedia.

[32]  Shumeet Baluja,et al.  VisualRank: Applying PageRank to Large-Scale Image Search , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[34]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[35]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Dimitrios Gunopulos,et al.  Locally Adaptive Metric Nearest-Neighbor Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[38]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[39]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[40]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[41]  Hong Chang,et al.  Kernel-based distance metric learning for content-based image retrieval , 2007, Image Vis. Comput..

[42]  Antonio Torralba,et al.  Scene-Centered Description from Spatial Envelope Properties , 2002, Biologically Motivated Computer Vision.

[43]  Dong Yu,et al.  Feature Learning in Deep Neural Networks - A Study on Speech Recognition Tasks , 2013, ICLR.

[44]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[45]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[46]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[47]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[48]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[49]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[50]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[51]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[52]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[53]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[54]  Chunyan Miao,et al.  Learning to name faces: a multimodal learning scheme for search-based face annotation , 2013, SIGIR.

[55]  Yongdong Zhang,et al.  Contextual Query Expansion for Image Retrieval , 2014, IEEE Transactions on Multimedia.

[56]  Rong Jin,et al.  Regularized Distance Metric Learning: Theory and Algorithm , 2009, NIPS.

[57]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[58]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.