ST-CNN: Spatial-Temporal Convolutional Neural Network for crowd counting in videos

The task of crowd counting and density maps estimating from videos is challenging due to severe occlusions, scene perspective distortions and diverse crowd distributions. Conventional crowd counting methods via deep learning technique process each video frame independently with no consideration of the intrinsic temporal correlation among neighboring frames, thus making the performance lower than the required level of real-world applications. To overcome this shortcoming, a new end-to-end deep architecture named Spatial-Temporal Convolutional Neural Network (ST-CNN) is proposed, which unifies 2D convolutional neural network (C2D) and 3D convolutional neural network (C3D) to learn spatial-temporal features in the same framework. On top of that, a merging scheme is performed on the resulting density maps, taking advantages of the spatial-temporal information simultaneously for the crowd counting task. Experimental results on two benchmark data sets â Mall dataset and WorldExpo′10 dataset show that our ST-CNN outperforms the state-of-the-art models in terms of mean absolutely error (MAE) and mean squared error (MSE).

[1]  Sridha Sridharan,et al.  An evaluation of crowd counting methods, features and regression models , 2015, Comput. Vis. Image Underst..

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Srinivas S. Kruthiventi,et al.  CrowdNet: A Deep Convolutional Network for Dense Crowd Counting , 2016, ACM Multimedia.

[4]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[5]  Shaogang Gong,et al.  Feature Mining for Localised Crowd Counting , 2012, BMVC.

[6]  Liang Lin,et al.  Is Faster R-CNN Doing Well for Pedestrian Detection? , 2016, ECCV.

[7]  Peter H. N. de With,et al.  Employing a RGB-D sensor for real-time tracking of humans across multiple re-entries in a smart environment , 2012, IEEE Transactions on Consumer Electronics.

[8]  Ling Shao,et al.  End-to-end video background subtraction with 3d convolutional neural networks , 2017, Multimedia Tools and Applications.

[9]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Pei Li,et al.  People counting based on head detection combining Adaboost and CNN in crowded surveillance environment , 2016, Neurocomputing.

[11]  Bernt Schiele,et al.  Ten Years of Pedestrian Detection, What Have We Learned? , 2014, ECCV Workshops.

[12]  Vishal M. Patel,et al.  A Survey of Recent Advances in CNN-based Single Image Crowd Counting and Density Estimation , 2017, Pattern Recognit. Lett..

[13]  Ryuzo Okada,et al.  COUNT Forest: CO-Voting Uncertain Number of Targets Using Random Forest for Crowd Density Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Ling Shao,et al.  Unsupervised Deep Video Hashing via Balanced Code for Large-Scale Video Retrieval , 2019, IEEE Transactions on Image Processing.

[15]  Lin Wu,et al.  Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification , 2018, IEEE Transactions on Multimedia.

[16]  Xiaogang Wang,et al.  Cross-scene crowd counting via deep convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yongdong Zhang,et al.  A Fast Uyghur Text Detector for Complex Background Images , 2018, IEEE Transactions on Multimedia.

[18]  Tommy W. S. Chow,et al.  A neural-based crowd estimation by hybrid global learning algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[19]  Deyu Meng,et al.  DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[22]  Qionghai Dai,et al.  DECODE: Deep Confidence Network for Robust Image Classification , 2019, IEEE Transactions on Image Processing.

[23]  Chen Chen,et al.  Gabor Convolutional Networks , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[24]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[25]  Qiang Ni,et al.  Joint Image-Text Hashing for Fast Large-Scale Cross-Media Retrieval Using Self-Supervised Deep Learning , 2019, IEEE Transactions on Industrial Electronics.

[26]  Yongdong Zhang,et al.  Supervised Hash Coding With Deep Neural Network for Environment Perception of Intelligent Vehicles , 2018, IEEE Transactions on Intelligent Transportation Systems.

[27]  Shenghua Gao,et al.  Single-Image Crowd Counting via Multi-Column Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Dit-Yan Yeung,et al.  Spatiotemporal Modeling for Crowd Counting in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Lin Wu,et al.  What-and-Where to Match: Deep Spatially Multiplicative Integration Networks for Person Re-identification , 2017, Pattern Recognit..

[30]  Adrien Descamps,et al.  Counting People in the Crowd Using a Generic Head Detector , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[31]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Lin Wu,et al.  Deep adaptive feature embedding with local sample distributions for person re-identification , 2017, Pattern Recognit..

[33]  Nuno Vasconcelos,et al.  Privacy preserving crowd monitoring: Counting people without people models or tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Xue Li,et al.  Deep Attention-Based Spatially Recursive Networks for Fine-Grained Visual Recognition , 2019, IEEE Transactions on Cybernetics.

[35]  Ling Shao,et al.  Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[36]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[37]  Nuno Vasconcelos,et al.  Counting People With Low-Level Features and Bayesian Regression , 2012, IEEE Transactions on Image Processing.

[38]  Svetha Venkatesh,et al.  Face Recognition Using Kernel Ridge Regression , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Daniel Oñoro-Rubio,et al.  Towards Perspective-Free Object Counting with Deep Learning , 2016, ECCV.