Deep visual feature learning for vehicle detection, recognition and re-identification

Along with the ever-increasing number of motor vehicles in current transportation systems, intelligent video surveillance and management becomes more necessary which is one of the important artificial intelligence fields. Vehicle-related problems are being widely explored and applied practically. Among various techniques, computer vision and machine learning algorithms have been the most popular ones since a vast of video/image surveillance data are available for research, nowadays. In this thesis, vision-based approaches for vehicle detection, recognition, and re-identification are extensively investigated. Moreover, to address different challenges, several novel methods are proposed to overcome weaknesses of previous works and achieve compelling performance. Deep visual feature learning has been widely researched in the past five years and obtained huge progress in many applications including image classification, image retrieval, object detection, image segmentation and image generation. Compared with traditional machine learning methods which consist of hand-crafted feature extraction and shallow model learning, deep neural networks can learn hierarchical feature representations from low-level to high-level features to get more robust recognition precision. For some specific tasks, researchers prefer to embed feature learning and classification/regression methods into end-to-end models, which can benefit both the accuracy and efficiency. In this thesis, deep models are mainly investigated to study the research problems. Vehicle detection is the most fundamental task in intelligent video surveillance but faces many challenges such as severe illumination and viewpoint variations, occlusions and multi-scale problems. Moreover, learning vehicles’ diverse attributes is also an interesting and valuable problem. To address these tasks and their difficulties, a fast framework of Detection and Annotation for Vehicles (DAVE) is presented, which effectively combines vehicle detection and attributes annotation. DAVE consists of two convolutional neural networks (CNNs): afastvehicleproposalnetwork(FVPN)forvehicle-likeobjectsextraction and an attributes learning network (ALN) aiming to verify each proposal and infer each vehicle’s pose, color and type simultaneously. These two nets are jointly optimized so that the abundant latent knowledge learned from the ALN can be exploited to guide FVPN training. Once the model is trained, it can achieve efficient vehicle detection and annotation for real-world traffic surveillance data. The second research problem of the thesis focuses on vehicle re-identification (re-ID). Vehicle re-ID aims to identify a target vehicle in different cameras with non-overlapping views. It has received far less attention in the computer vision community than the prevalent person re-ID problem. Possible reasons for this slow progress are the lack of appropriate research data and the special 3D structure of a vehicle. Previous works have generally focused on some specific views (e.g. front), but these methods are less effective in realistic scenarios where vehicles usually appear in arbitrary view points to cameras. In this thesis, I focus on the uncertainty of vehicle viewpoint in re-ID, proposing four different approaches to address the multi-view vehicle re-ID problem: (1) The Spatially Concatenated ConvNet (SCCN) in an encoder-decoder architecture is proposed to learn transformations across different viewpoints of a vehicle, and then spatially concatenate all the feature maps for further fusing them into a multi-view feature representation. (2) A Cross-View Generative Adversarial Network (XVGAN)is designed to take an input image’s feature as conditional embedding to effectively infer cross-view images. The features of the inferred and original images are combined to learn distance metrics for re-ID.(3)The great advantages of a bi-directional Long Short-Term Memory (LSTM) loop are investigated of modeling transformations across continuous view variation of a vehicle. (4) A Viewpoint-aware Attentive Multi-view Inference (VAMI) model is proposed, adopting a viewpoint-aware attention model to select core regions at different viewpoints and then performing multi-view feature inference by an adversarial training architecture.

[1]  Zezhi Chen,et al.  Efficient annotation of video for vehicle type classification , 2013, 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013).

[2]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Ling Shao,et al.  Deep Action Parsing in Videos With Large-Scale Synthesized Data , 2018, IEEE Transactions on Image Processing.

[4]  Nanning Zheng,et al.  Similarity Learning with Spatial Constraints for Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[7]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[8]  Yunde Jia,et al.  Vehicle Type Classification Using Unsupervised Convolutional Neural Network , 2014, 2014 22nd International Conference on Pattern Recognition.

[9]  Alessandro Perina,et al.  Person re-identification by symmetry-driven accumulation of local features , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Sei-Wang Chen,et al.  Automatic license plate recognition , 2004, IEEE Transactions on Intelligent Transportation Systems.

[11]  Nanning Zheng,et al.  Person Re-identification by Multi-Channel Parts-Based CNN with Improved Triplet Loss Function , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ryad Benosman,et al.  A multi-cameras 3D volumetric method for outdoor scenes: a road traffic monitoring application , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[14]  Ling Shao,et al.  DAP3D-Net: Where, what and how actions occur in videos? , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[16]  Michael Jones,et al.  An improved deep learning architecture for person re-identification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[18]  Yu Zhou,et al.  Fine-Grained Vehicle Model Recognition Using A Coarse-to-Fine Convolutional Neural Network Architecture , 2017, IEEE Transactions on Intelligent Transportation Systems.

[19]  Ling Shao,et al.  Fast Automatic Vehicle Annotation for Urban Traffic Surveillance , 2018, IEEE Transactions on Intelligent Transportation Systems.

[20]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[21]  Ling Shao,et al.  Vehicle Re-Identification by Deep Hidden Multi-View Inference , 2018, IEEE Transactions on Image Processing.

[22]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[23]  Kaiqi Huang,et al.  Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[25]  Sergio A. Velastin,et al.  A Review of Computer Vision Techniques for the Analysis of Urban Traffic , 2011, IEEE Transactions on Intelligent Transportation Systems.

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  M. Tekalp,et al.  Automatic Vehicle Counting from Video for Traffic Flow Analysis , 2007, 2007 IEEE Intelligent Vehicles Symposium.

[28]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[29]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Tan Yee Fan,et al.  A Tutorial on Support Vector Machine , 2009 .

[32]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[33]  Tao Mei,et al.  Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Matti Pietikäinen,et al.  Face Description with Local Binary Patterns: Application to Face Recognition , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[36]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[37]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.