Scale Pyramid Network for Crowd Counting

Crowd counting is a concerned yet challenging task in computer vision. The difficulty is particularly pronounced by scale variations in crowd images. Most state-of-art approaches tackle the multi-scale problem by adopting multi-column CNN architectures where different columns are designed with different filter sizes to adapt to variable pedestrian/object sizes. However, the structure is bloated and inefficient, and it is infeasible to adopt multiple deep columns due to the huge resource cost. We instead propose a Scale Pyramid Network (SPN) which adopts a shared single deep column structure and extracts multi-scale information in high layers by Scale Pyramid Module. In Scale Pyramid Module, we specifically employ different rates of dilated convolutions in parallel instead of traditional convolutions with different sizes. Compared to other methods of coping with scale issues, our single column structure with Scale Pyramid Module can get more accurate estimation with simpler structure and less complexity of training. And our Scale Pyramid Module can be easily applied to a deep network. Experimental results on four datasets show that our method achieves state-of-the-art performance. On ShanghaiTech Part_A dataset which is challenging for its highly congested scenes and scale variation, we achieve 9.5% lower MAE and 13.5% lower MSE than the previous state-of-the-art method. We also extend our model on TRANCOS vehicle counting dataset and significantly achieve 5.9% lower GAME(0), 10% lower GAME(1), 24.5% lower GAME(2), 38.7% lower GAME(3) than the previous state-of-the-art method. The experimental results prove the robustness of our model for crowd counting, especially with scale variations.

[1]  Yuhong Li,et al.  CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Jonathan Ventura,et al.  An Aggregated Multicolumn Dilated Convolution Network for Perspective-Free Counting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[3]  Sridha Sridharan,et al.  Crowd Counting Using Multiple Local Features , 2009, 2009 Digital Image Computing: Techniques and Applications.

[4]  Vishal M. Patel,et al.  A Survey of Recent Advances in CNN-based Single Image Crowd Counting and Density Estimation , 2017, Pattern Recognit. Lett..

[5]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Antoni B. Chan,et al.  Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks—Counting, Detection, and Tracking , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  Lu Zhang,et al.  Crowd Counting via Scale-Adaptive Convolutional Neural Network , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8]  José M. F. Moura,et al.  FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Andrew Zisserman,et al.  Learning To Count Objects in Images , 2010, NIPS.

[10]  Noel E. O'Connor,et al.  Fully Convolutional Crowd Counting on Highly Congested Scenes , 2016, VISIGRAPP.

[11]  Vishal M. Patel,et al.  Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Joost van de Weijer,et al.  Leveraging Unlabeled Data for Crowd Counting by Learning to Rank , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Ullrich Köthe,et al.  Learning to count with regression forest and structured labels , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[14]  Haroon Idrees,et al.  Multi-source Multi-scale Counting in Extremely Dense Crowd Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[16]  Daniel Oñoro-Rubio,et al.  Towards Perspective-Free Object Counting with Deep Learning , 2016, ECCV.

[17]  Lior Wolf,et al.  Learning to Count with CNN Boosting , 2016, ECCV.

[18]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[19]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  R. Venkatesh Babu,et al.  Divide and Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Deyu Meng,et al.  DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Bingbing Ni,et al.  Crowd Counting via Adversarial Cross-Scale Consistency Pursuit , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[24]  Shenghua Gao,et al.  Single-Image Crowd Counting via Multi-Column Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Shiv Surya,et al.  Switching Convolutional Neural Network for Crowd Counting , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hieu Le,et al.  Iterative Crowd Counting , 2018, ECCV.

[27]  Vishal M. Patel,et al.  CNN-Based cascaded multi-task learning of high-level prior and density estimation for crowd counting , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[28]  Xiaogang Wang,et al.  Cross-scene crowd counting via deep convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Shaogang Gong,et al.  Feature Mining for Localised Crowd Counting , 2012, BMVC.

[30]  Guoping Qiu,et al.  Crowd density estimation based on rich features and random projection forest , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[31]  Srinivas S. Kruthiventi,et al.  CrowdNet: A Deep Convolutional Network for Dense Crowd Counting , 2016, ACM Multimedia.

[32]  Ryuzo Okada,et al.  COUNT Forest: CO-Voting Uncertain Number of Targets Using Random Forest for Crowd Density Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Yi Wang,et al.  Fast visual object counting via example-based density estimation , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Nuno Vasconcelos,et al.  Privacy preserving crowd monitoring: Counting people without people models or tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Haidi Ibrahim,et al.  Recent survey on crowd density estimation and counting for visual surveillance , 2015, Eng. Appl. Artif. Intell..

[37]  Shaogang Gong,et al.  Cumulative Attribute Space for Age and Crowd Density Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.