论文信息 - SMCA-CNN: Learning a Semantic Mask and Cross-Scale Adaptive Feature for Robust Crowd Counting

SMCA-CNN: Learning a Semantic Mask and Cross-Scale Adaptive Feature for Robust Crowd Counting

Density-based crowd counting methods with deep convolutional neural network (CNN) have achieved the state of the art on the challenging datasets. Experimental results showed that the performance of these methods suffers from two problems: 1) Background interference problem: there are some estimated spurious density values in the background regions which degrade the counting accuracy. 2) Cross-scale problem: the scale of human heads varies greatly in crowd images which lead to poorer quality of the density maps. In this study, we aim to address the two problems for enhancing the counting accuracy. To address the former, a light semantic mask module (SMM) is proposed to learn the semantic masks of crowd images where the ground-truth semantic masks generated from the ground-truth density map are taken as the supervision information. To tackle the latter, we propose a span architecture (SA) to effectively capture the large-scale-variation information in the crowd images by building the cross-scale features from the pyramidal structure of a deep CNN. To adaptively leverage the salient cross-scale features, a Cross-scale Adaptive Module (CAM) is delicately designed. In the end, integrating all elements above, an end-to-end trainable and single-column crowd counting model called the SMCA-CNN is developed and trained with a joint loss function consisting of the cross-entropy loss and Euclidean loss. Extensive experiments on five challenging datasets demonstrate the effectiveness of our SMCA-CNN. Compared with the previous state-of-the-art methods, our model achieves 17.1% lower MAE on dataset UCF_CC_50 and 23.6% lower MAE on the newly published dataset UCF-QNRF.

[1] Paul A. Viola,et al. Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[2] Andrew Zisserman,et al. Learning To Count Objects in Images , 2010, NIPS.

[3] Guoyan Zheng,et al. Crowd Counting with Deep Negative Correlation Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[5] Robert T. Collins,et al. Marked point processes for crowd counting , 2009, CVPR.

[6] Shenghua Gao,et al. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Shiv Surya,et al. Switching Convolutional Neural Network for Crowd Counting , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Daniel Oñoro-Rubio,et al. Towards Perspective-Free Object Counting with Deep Learning , 2016, ECCV.

[9] Nuno Vasconcelos,et al. Bayesian Model Adaptation for Crowd Counts , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10] Guoping Qiu,et al. Crowd density estimation based on rich features and random projection forest , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12] Ronan Collobert,et al. Learning to Refine Object Segments , 2016, ECCV.

[13] Vishal M. Patel,et al. Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14] Meng Wang,et al. Automatic adaptation of a generic pedestrian detector to a specific traffic scene , 2011, CVPR 2011.

[15] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Hieu Le,et al. Iterative Crowd Counting , 2018, ECCV.

[17] Nuno Vasconcelos,et al. Privacy preserving crowd monitoring: Counting people without people models or tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18] Eero P. Simoncelli,et al. Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[19] Hai Tao,et al. Counting Pedestrians in Crowds Using Viewpoint Invariant Training , 2005, BMVC.

[20] R. Venkatesh Babu,et al. Divide and Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Srinivas S. Kruthiventi,et al. CrowdNet: A Deep Convolutional Network for Dense Crowd Counting , 2016, ACM Multimedia.

[24] Ryuzo Okada,et al. COUNT Forest: CO-Voting Uncertain Number of Targets Using Random Forest for Crowd Density Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25] Xiaogang Wang,et al. Cross-scene crowd counting via deep convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Shaogang Gong,et al. Feature Mining for Localised Crowd Counting , 2012, BMVC.

[27] Vishal M. Patel,et al. CNN-Based cascaded multi-task learning of high-level prior and density estimation for crowd counting , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[28] Fei Su,et al. Scale Aggregation Network for Accurate and Efficient Crowd Counting , 2018, ECCV.

[29] Ramakant Nevatia,et al. Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[30] Ivan Laptev,et al. Density-aware person detection and tracking in crowds , 2011, ICCV.

[31] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[32] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] Roberto Cipolla,et al. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] Jonathan Ventura,et al. An Aggregated Multicolumn Dilated Convolution Network for Perspective-Free Counting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36] Serge J. Belongie,et al. Counting Crowded Moving Objects , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[37] Р Ю Чуйков,et al. Обнаружение транспортных средств на изображениях загородных шоссе на основе метода Single shot multibox Detector , 2017 .

[38] Tieniu Tan,et al. Estimating the number of people in crowded scenes by MID based foreground segmentation and head-shoulder detection , 2008, 2008 19th International Conference on Pattern Recognition.

[39] Faliang Chang,et al. Attention to Head Locations for Crowd Counting , 2019, ICIG.

[40] Haroon Idrees,et al. Detecting Humans in Dense Crowds Using Locally-Consistent Scale Prior and Global Occlusion Reasoning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Dit-Yan Yeung,et al. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[42] Haroon Idrees,et al. Multi-source Multi-scale Counting in Extremely Dense Crowd Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[43] Robert T. Collins,et al. Marked point processes for crowd counting , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44] Pushmeet Kohli,et al. On Detection of Multiple Object Instances Using Hough Transforms , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[46] Haroon Idrees,et al. Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds , 2018, ECCV.

[47] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[48] Mark W. Schmidt,et al. Where are the Blobs: Counting by Localization with Point Supervision , 2018, ECCV.

[49] Qijun Chen,et al. Revisiting Perspective Information for Efficient Crowd Counting , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Bingbing Ni,et al. Crowd Counting via Adversarial Cross-Scale Consistency Pursuit , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[52] Nuno Vasconcelos,et al. Bayesian Poisson regression for crowd counting , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[53] Yuhong Li,et al. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54] Yu Zheng,et al. Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction , 2016, AAAI.

[55] Sridha Sridharan,et al. Crowd Counting Using Multiple Local Features , 2009, 2009 Digital Image Computing: Techniques and Applications.

[56] Nan Wang,et al. Counting challenging crowds robustly using a multi-column multi-task convolutional neural network , 2018, Signal Process. Image Commun..

[57] Yadong Mu,et al. Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Yi Wang,et al. Fast visual object counting via example-based density estimation , 2016, 2016 IEEE International Conference on Image Processing (ICIP).