Scene Classification with Deep Convolutional Neural Networks

The use of massive datasets like ImageNet and the revival of Convolutional Neural Networks (CNNs) for learning deep features has significantly improved the performance of object recognition. However, performance at scene classification has not achieved the same level of success since there is still semantic gap between the deep features and the high-level context. In this project we proposed a novel scene classification method which combines CNN and Spatial Pyramid to generate high-level contextaware features for one-vs-all linear SVMs. Our method achieves better average accuracy rate (68.295%) than any other state-of-the-art result on MIT indoor67 dataset using only the deep features trained from ImageNet. 1. Related Work Scene classification means to provide information about the semantic category or the function of a given image. Among different kinds of scene classification tasks, the indoor scene classification is considered to be one of the most difficult since the lack of discriminative features and contexts at the high level [9]. Spatial pyramid representation[7] is a popular method used for scene classification tasks. It is a simple and computationally efficient extension of an orderless bag-of-features image representation. However, without a proper high-level feature representation, such schemes often fail to offer sufficient semantic information of a scene. Object bank[5] is among the first to propose a high-level image representation for scene classification. It uses a large number of pre-trained generic object detectors to create response maps for high level visual recognition tasks. The combination of off-the-shelf object detectors and a simple linear prediction model with a sparse-coding scheme achieves superior predictive power over similar linear prediction models trained on conventional representations. However, this method also limits the performance of their system to the performance of the object detectors they choose. Recently, Convolutional Neural Networks (CNNs) with flexible capacity makes training from largescale dataset such as ImageNet [2] possible. In the work of A. Krizhevsky et al.[6], they trained one of the largest CNNs on the subsets of ImageNet and achieved better results than any other state-of-the-art methods in 2012. While their CNN system focuses on object detection, the features generated can be used for other applications such as scene classification. Two types of improvements has been done on top of their CNN works. The first type of improvement tries to address the problem of generating possible object locations in an image. Selective search method [10] combines the strength of both an exhaustive search and segmentation and results in a small set of data-driven, class-independent, high quality locations. Girshick et al. propose the Regions with CNN features (R-CNN) method [3] as a more effective feature generation method. Alternatively, Zhou et al. try to increase the performance of scene classification using CNN by creating a new scene-centric database [11]. 2. Technical Approach Previous work on Convolutional Neural Networks (CNNs) implies that it may capture the high-level representations of an image using a certain deep layer feature set. Our goal of this project is to answer one single question: Whether or not CNNs can help with the feature representation to extract high-level information of an image scene and thus improve the scene classification precision? We choose a CNN which is pre-trained on ImageNet dataset (ImageNet-CNN) since it is a large-scale general object recognition dataset which consists of over 15 million labeled high-resolution images in over 22,000 categories. We use CNN pre-trained on such dataset with the hope to reduce the chance of over-fitting to certain scenes. To utilize a pre-trained ImageNet CNN and for the efficiency of the feature extraction process, we use a popular library: Caffe [4]. To better observe the impact of a good feature representation, we choose a very difficult dataset: MIT-Indoor67 dataset, which includes 15,620 images of over 67 indoor scenes. Object Bank achieves only 37.6% recognition rate on this dataset. We expect that using deep features extracted from CNNs can significantly improve the results on this dataset.