Combining deep learning and unsupervised clustering to improve scene recognition performance

Deep Neural Networks (DNN) are now the state-of-the-art for many image and object recognition tasks, as illustrated by their performance on standard benchmarks. The success of DNNs is attributed to their ability to learn rich mid-level image representations, as opposed to hand-designed low-level features used in other image analysis methods. Typically a large dataset of unlabeled images is used for unsupervised feature learning, and then standard classifiers are trained on the features extracted from the images in a labeled set. In this paper, we show that clustering the images using the features from the DNN allows more accurate per-cluster classifiers to be learned, which improves the overall classification accuracy. We demonstrate the effectiveness of our approach on a scene recognition task.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Cecilia R. Aragon,et al.  A fast contour descriptor algorithm for supernova image classification , 2007, Electronic Imaging.

[7]  Ning Zhou,et al.  A Hybrid Probabilistic Model for Unified Collaborative and Content-Based Image Tagging , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Krista A. Ehinger,et al.  SUN Database: Exploring a Large Collection of Scene Categories , 2014, International Journal of Computer Vision.

[9]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[10]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[11]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[12]  John Langford Vowpal Wabbit , 2014 .

[13]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[14]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[15]  David G. Stork,et al.  Pattern Classification , 1973 .

[16]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.