Saliency-based selection of visual content for deep convolutional neural networks

The automatic description of digital multimedia content was mainly developed for classification tasks, retrieval systems and massive ordering of data. Preservation of cultural heritage is a field of high importance of application of these methods. We address classification problem in cultural heritage such as classification of architectural styles in digital photographs of Mexican cultural heritage. In general, the selection of relevant content in the scene for training classification models makes the models more efficient in terms of accuracy and training time. Here we use a saliency-driven approach to predict visual attention in images and use it to train a Deep Convolutional Neural Network. Also, we present an analysis of the behavior of the models trained under the state-of-the-art image cropping and the saliency maps. To train invariant models to rotations, data augmentation of training set is required, which posses problems of filling normalization of crops, we study were different padding techniques and we find an optimal solution. The results are compared with the state-of-the-art in terms of accuracy and training time. Furthermore, we are studying saliency cropping in training and generalization for another classical task such as weak labeling of massive collections of images containing objects of interest. Here the experiments are conducted on a large subset of ImageNet database. This work is an extension of preliminary research in terms of image padding methods and generalization on large scale generic database.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Jenny Benois-Pineau,et al.  Goal-oriented top-down probabilistic visual attention model for recognition of manipulated objects in egocentric videos , 2015, Signal Process. Image Commun..

[3]  Vittorio Murino,et al.  Weighted bag of visual words for object recognition , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jenny Benois-Pineau,et al.  Extraction of saliency in images and video: Problems, methods and applications. A survey , 2017, 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA).

[6]  Luc Van Gool,et al.  DeepProposals: Hunting Objects and Actions by Cascading Deep Convolutional Layers , 2016, International Journal of Computer Vision.

[7]  Jenny Benois-Pineau,et al.  Perceptual modeling in the problem of active object recognition in visual scenes , 2016, Pattern Recognit..

[8]  Andrew G. Howard,et al.  Some Improvements on Deep Convolutional Neural Network Based Image Classification , 2013, ICLR.

[9]  Jitendra Malik,et al.  Parsing Images of Architectural Scenes , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[11]  R. Sablatnig,et al.  Classification of gothic and baroque architectural elements , 2012, 2012 19th International Conference on Systems, Signals and Image Processing (IWSSIP).

[12]  Eduardo Zalama Casanova,et al.  Applying Deep Learning Techniques to Cultural Heritage Images Within the INCEPTION Project , 2016, EuroMed.

[13]  Frédo Durand,et al.  Where Should Saliency Models Look Next? , 2016, ECCV.

[14]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[15]  Valérie Gouet-Brunet,et al.  Combination of image descriptors for the exploration of cultural photographic collections , 2017, J. Electronic Imaging.

[16]  Zhengjun Liu,et al.  Building extraction from high resolution imagery based on multi-scale object oriented classification and probabilistic Hough transform , 2005, Proceedings. 2005 IEEE International Geoscience and Remote Sensing Symposium, 2005. IGARSS '05..

[17]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[18]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[19]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[20]  Nuno Vasconcelos,et al.  Biologically Inspired Object Tracking Using Center-Surround Saliency Mechanisms , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[22]  R. C. Langford How People Look at Pictures, A Study of the Psychology of Perception in Art. , 1936 .

[23]  Jenny Benois-Pineau,et al.  Saliency Driven Object recognition in egocentric videos with deep CNN: toward application in assistance to Neuroprostheses , 2016, Comput. Vis. Image Underst..

[24]  Yanchun Zhang,et al.  Historic Chinese Architectures Image Retrieval by SVM and Pyramid Histogram of Oriented Gradients Features , 2010 .

[25]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[27]  Pingkun Yan,et al.  Visual Saliency by Selective Contrast , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  Ilmério Reis da Silva,et al.  Spatial Locality Weighting of Features Using Saliency Map with a Bag-of-Visual-Words Approach , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.

[29]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[30]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[31]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Cristian Sminchisescu,et al.  Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition , 2012, ECCV.

[33]  Jenny Benois-Pineau,et al.  Architectural style classification of Mexican historical buildings using deep convolutional neural networks and sparse features , 2016, J. Electronic Imaging.

[34]  Dongbing Gu,et al.  Abrupt motion tracking using a visual saliency embedded particle filter , 2014, Pattern Recognit..

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[37]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Thomas Sikora,et al.  Shape-adaptive DCT for generic coding of video , 1995, IEEE Trans. Circuits Syst. Video Technol..

[39]  L. Van Gool,et al.  AUTOMATIC ARCHITECTURAL STYLE RECOGNITION , 2012 .

[40]  Cordelia Schmid,et al.  Discriminative spatial saliency for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Yll Haxhimusa,et al.  Architectural Style Classification of Building Facade Windows , 2011, ISVC.

[43]  Xuelong Li,et al.  Saliency Detection by Multiple-Instance Learning , 2013, IEEE Transactions on Cybernetics.

[44]  Jenny Benois-Pineau,et al.  Visual Content Indexing and Retrieval with Psycho-Visual Models , 2017, Visual Content Indexing and Retrieval with Psycho-Visual Models.

[45]  Gayane Shalunts,et al.  Architectural Style Classification of Building Facade Towers , 2011, ISVC.

[46]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[48]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  Junjie Wu,et al.  Architectural Style Classification Using Multinomial Latent Logistic Regression , 2014, ECCV.

[50]  Lucas Paletta,et al.  Window Detection in Facades , 2007, 14th International Conference on Image Analysis and Processing (ICIAP 2007).

[51]  Jenny Benois-Pineau,et al.  Connoisseur: classification of styles of Mexican architectural heritage with deep learning and visual attention prediction , 2017, CBMI.

[52]  Adrian G. Bors,et al.  Image retrieval based on query by saliency content , 2015, Digit. Signal Process..