FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras

In this paper, we develop deep spatio-temporal neural networks to sequentially count vehicles from low quality videos captured by city cameras (citycams). Citycam videos have low resolution, low frame rate, high occlusion and large perspective, making most existing methods lose their efficacy. To overcome limitations of existing methods and incorporate the temporal information of traffic video, we design a novel FCN-rLSTM network to jointly estimate vehicle density and vehicle count by connecting fully convolutional neural networks (FCN) with long short term memory networks (LSTM) in a residual learning fashion. Such design leverages the strengths of FCN for pixel-level prediction and the strengths of LSTM for learning complex temporal dynamics. The residual learning connection reformulates the vehicle count regression as learning residual functions with reference to the sum of densities in each frame, which significantly accelerates the training of networks. To preserve feature map resolution, we propose a Hyper-Atrous combination to integrate atrous convolution in FCN and combine feature maps of different convolution layers. FCN-rLSTM enables refined feature representation and a novel end-to-end trainable mapping from pixels to vehicle count. We extensively evaluated the proposed method on different counting tasks with three datasets, with experimental results demonstrating their effectiveness and robustness. In particular, FCN-rLSTM reduces the mean absolute error (MAE) from 5.31 to 4.21 on TRANCOS; and reduces the MAE from 2.74 to 1.53 on WebCamT. Training process is accelerated by 5 times on average.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  R. Cucchiara,et al.  Statistic and knowledge-based moving object detection in traffic scenes , 2000, ITSC2000. 2000 IEEE Intelligent Transportation Systems. Proceedings (Cat. No.00TH8493).

[3]  Qi Tian,et al.  Highway traffic information extraction from Skycam MPEG video , 2002, Proceedings. The IEEE 5th International Conference on Intelligent Transportation Systems.

[4]  Michalis E. Zervakis,et al.  A survey of video processing techniques for traffic applications , 2003, Image Vis. Comput..

[5]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[6]  Svetha Venkatesh,et al.  Face Recognition Using Kernel Ridge Regression , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Nuno Vasconcelos,et al.  Privacy preserving crowd monitoring: Counting people without people models or tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  L. Craig Davis,et al.  Introduction to Modern Traffic Flow Theory and Control: The Long Road to Three-Phase Traffic Theory , 2009 .

[9]  Mun Wai Lee,et al.  Traffic analysis with low frame rate camera networks , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[10]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[11]  Sanyuan Zhang,et al.  Vehicles detection in Traffic Flow , 2010, 2010 Sixth International Conference on Natural Computation.

[12]  Andrew Zisserman,et al.  Learning To Count Objects in Images , 2010, NIPS.

[13]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[14]  Ahmed Helmy,et al.  Spatial and Temporal Analysis of Planet Scale Vehicular Imagery Data , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[15]  Bing-Fei Wu,et al.  A Real-Time Vision System for Nighttime Vehicle Detection and Traffic Surveillance , 2011, IEEE Transactions on Industrial Electronics.

[16]  Vikramaditya Dangi,et al.  Image Processing Based Intelligent Traffic Controller , 2012 .

[17]  K SuganyaDevi,et al.  EFFICIENT FOREGROUND EXTRACTION BASED ON OPTICAL FLOW AND SMED FOR ROAD TRAFFIC ANALYSIS , 2012 .

[18]  Ahmed Helmy,et al.  Modeling and Characterization of Urban Streets' Vehicular Mobility using Web Cameras , 2012 .

[19]  Zezhi Chen,et al.  Vehicle detection, tracking and classification in urban traffic , 2012, 2012 15th International IEEE Conference on Intelligent Transportation Systems.

[20]  Shaogang Gong,et al.  Feature Mining for Localised Crowd Counting , 2012, BMVC.

[21]  Silong Peng,et al.  Model based vehicle localization for urban traffic surveillance using image gradient based matching , 2012, 2012 15th International IEEE Conference on Intelligent Transportation Systems.

[22]  Ahmed Helmy,et al.  Modeling and characterization of urban vehicular mobility using web cameras , 2012, 2012 Proceedings IEEE INFOCOM Workshops.

[23]  Chun-Ming Tsai,et al.  Intelligent Moving Objects Detection via Adaptive Frame Differencing Method , 2013, ACIIDS.

[24]  Prachi M. Patil,et al.  Traffic , 2004, Voluminous States.

[25]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[26]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[27]  Shaogang Gong,et al.  Cumulative Attribute Space for Age and Crowd Density Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Ayush Sharma,et al.  Smart traffic lights switching and traffic density calculation using video processing , 2014, 2014 Recent Advances in Engineering and Computational Sciences (RAECS).

[29]  Saturnino Maldonado-Bascón,et al.  Extremely Overlapping Vehicle Counting , 2015, IbPRIA.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Xiaogang Wang,et al.  Cross-scene crowd counting via deep convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  José M. F. Moura,et al.  Traffic flow from a low frame rate city camera , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[35]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[36]  Xiaogang Wang,et al.  Crossing-Line Crowd Counting with Two-Phase Deep Neural Networks , 2016, ECCV.

[37]  Shenghua Gao,et al.  Single-Image Crowd Counting via Multi-Column Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Daniel Oñoro-Rubio,et al.  Towards Perspective-Free Object Counting with Deep Learning , 2016, ECCV.

[39]  Thambipillai Srikanthan,et al.  Real-time road traffic density estimation using block variance , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[40]  Moses Soh Learning Cnn Lstm Architectures For Image Caption Generation , 2016 .

[41]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Andrew Zisserman,et al.  Counting in the Wild , 2016, ECCV.

[43]  Shih-Chii Liu,et al.  Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences , 2016, NIPS.

[44]  Vishal M. Patel,et al.  Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Yu Zheng,et al.  Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction , 2016, AAAI.

[46]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  José M. F. Moura,et al.  Understanding Traffic Density from Large-Scale Web Camera Data , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Dit-Yan Yeung,et al.  Spatiotemporal Modeling for Crowd Counting in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Vishal M. Patel,et al.  CNN-Based cascaded multi-task learning of high-level prior and density estimation for crowd counting , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[50]  Winston H. Hsu,et al.  Drone-Based Object Counting by Spatially Regularized Regional Proposal Network , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Vishal M. Patel,et al.  A Survey of Recent Advances in CNN-based Single Image Crowd Counting and Density Estimation , 2017, Pattern Recognit. Lett..

[52]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.