Following the recent progress in image classification and image captioning using deep learning, we developed a novel person retrieval system using natural language, which to our knowledge is first of its kind. Our system employs a state-of-the-art deep learning based natural language object retrieval framework to detect and retrieve people in images. Quantitative experimental results show significant improvement over state-of-the-art methods for generic object retrieval. This line of research provides great advantages for searching large amounts of video surveillance footage and it can also be utilized in other domains, such as human-robot interaction. Video surveillance cameras are everywhere—in small stores and apartments for indoor scenario and in parking lots and traffic lanes for wide-area observation. With increasingly ubiquitous security cameras, the challenge is not acquiring surveillance data but automatically recognizing what is valuable in the video. Understanding content from video alone, however, is extremely challenging due to factors such as low resolution, deformation, and occlusion. Therefore, it is highly desirable for a system to match objects of interest with a natural language description sentence. Here we employ a state-of-the-art deep learning framework (Hu et al. 2016, Hu, Rohrbach, and Darrell 2016) to retrieve people. The first challenge of our project is the lack of a dataset for natural language person retrieval tasks. We turn to the Cityscapes dataset (Cordts et al. 2016), a large-scale benchmark dataset for pixel-level and instance-level semantic labeling. Since the focus of our project is on person retrieval rather than semantic segmentation, only segmentation masks belonging to ‘person’ and ‘rider’ categories are transformed into ground truth bounding boxes based on the masks’ maximum and minimum value of (x, y) coordinates. Specifically, the (xMAX, yMAX) location is treated as the bottom-right corner of the bounding box while the (xMIN, yMIN) location is treated as the top-left corner. To Copyright © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. avoid small persons, bounding box size larger than 5000 are selected for further annotation via Amazon Mechanical Turk (AMT). Given a person inside a bounding box, the AMT workers need to describe the person and select attributes best matching the appearance. The region proposal network (RPN) in Faster R-CNN (Ren et al. 2015) is adopted to generate dozens bounding boxes with different confidence which might contain a person. The higher the confidence, the more likely it is for the bounding box to contain a person (Figure 1A). Since most bounding boxes with low confidence do not include a person’s entire body, the bounding boxes are filtered by setting the threshold of the confidence to 0.5. Additionally, the minimum size of the bounding box is set to 5,000 in order to avoid small persons (Figure 1B). Due to the limited number of persons, the dataset is augmented for training purposes by randomly selecting 3 shifted region proposals whose IOU with ground truth bounding boxes are larger than 0.5 (Figure 1C). The region proposals without augmentation (Figure 1B) and a description “An elderly man on the right riding a bike” are provided as input to the model for person retrieval (green region proposal, Figure 1D, E). The ground truth bounding box is shown in red. Figure 1. Procedures on region proposals generation and overview of natural language person retrieval framework. During the training phase, a positive training instance is comprised of one region proposal, the spatial configuration, its corresponding description and the label (true) Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)
[1]
Jürgen Schmidhuber,et al.
Long Short-Term Memory
,
1997,
Neural Computation.
[2]
Trevor Darrell,et al.
Natural Language Object Retrieval
,
2015,
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3]
Aaron C. Courville,et al.
Recurrent Batch Normalization
,
2016,
ICLR.
[4]
Sebastian Ramos,et al.
The Cityscapes Dataset for Semantic Urban Scene Understanding
,
2016,
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[5]
Sergey Ioffe,et al.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
,
2015,
ICML.
[6]
Kaiming He,et al.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
,
2015,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[7]
Andrew Zisserman,et al.
Very Deep Convolutional Networks for Large-Scale Image Recognition
,
2014,
ICLR.
[8]
Trevor Darrell,et al.
Segmentation from Natural Language Expressions
,
2016,
ECCV.