Point in, Box Out: Beyond Counting Persons in Crowds

Modern crowd counting methods usually employ deep neural networks (DNN) to estimate crowd counts via density regression. Despite their significant improvements, the regression-based methods are incapable of providing the detection of individuals in crowds. The detection-based methods, on the other hand, have not been largely explored in recent trends of crowd counting due to the needs for expensive bounding box annotations. In this work, we instead propose a new deep detection network with only point supervision required. It can simultaneously detect the size and location of human heads and count them in crowds. We first mine useful person size information from point-level annotations and initialize the pseudo ground truth bounding boxes. An online updating scheme is introduced to refine the pseudo ground truth during training; while a locally-constrained regression loss is designed to provide additional constraints on the size of the predicted boxes in a local neighborhood. In the end, we propose a curriculum learning strategy to train the network from images of relatively accurate and easy pseudo ground truth first. Extensive experiments are conducted in both detection and counting tasks on several standard benchmarks, e.g. ShanghaiTech, UCF_CC_50, WiderFace, and TRANCOS datasets, and the results show the superiority of our method over the state-of-the-art.

[1]  Pietro Perona,et al.  Strong supervision from weak annotation: Interactive training of deformable part models , 2011, 2011 International Conference on Computer Vision.

[2]  Hieu Le,et al.  Iterative Crowd Counting , 2018, ECCV.

[3]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Peiyun Hu,et al.  Finding Tiny Faces , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Qi Tian,et al.  Zigzag Learning for Weakly Supervised Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[8]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[9]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Pietro Perona,et al.  Multiclass recognition and part localization with humans in the loop , 2011, 2011 International Conference on Computer Vision.

[11]  Bernard Ghanem,et al.  Finding Tiny Faces in the Wild with Generative Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Nuno Vasconcelos,et al.  Bayesian Poisson regression for crowd counting , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Xiaogang Wang,et al.  Cross-scene crowd counting via deep convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Joost van de Weijer,et al.  Leveraging Unlabeled Data for Crowd Counting by Learning to Rank , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Shengcai Liao,et al.  Person re-identification by Local Maximal Occurrence representation and metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ivan Laptev,et al.  Density-aware person detection and tracking in crowds , 2011, ICCV.

[18]  Serge J. Belongie,et al.  Counting Crowded Moving Objects , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  Daniel Oñoro-Rubio,et al.  Towards Perspective-Free Object Counting with Deep Learning , 2016, ECCV.

[20]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[21]  Bo Han,et al.  TouchCut: Fast image and video segmentation using single-touch interaction , 2014, Comput. Vis. Image Underst..

[22]  Lu Zhang,et al.  Crowd Counting via Scale-Adaptive Convolutional Neural Network , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[24]  Frank Keller,et al.  Extreme Clicking for Efficient Object Annotation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Yuhong Li,et al.  CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Andrew Zisserman,et al.  Learning To Count Objects in Images , 2010, NIPS.

[27]  Mark W. Schmidt,et al.  Where are the Blobs: Counting by Localization with Point Supervision , 2018, ECCV.

[28]  Larry S. Davis,et al.  SSH: Single Stage Headless Face Detector , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Ben Taskar,et al.  MODEC: Multimodal Decomposable Models for Human Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Vishal M. Patel,et al.  Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[32]  Roberto Cipolla,et al.  Unsupervised Bayesian Detection of Independent Motion in Crowds , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[33]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[34]  Haroon Idrees,et al.  Multi-source Multi-scale Counting in Extremely Dense Crowd Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Deva Ramanan,et al.  Learning to parse images of articulated bodies , 2006, NIPS.

[36]  Nuno Vasconcelos,et al.  Privacy preserving crowd monitoring: Counting people without people models or tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Andrew Y. Ng,et al.  End-to-End People Detection in Crowded Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Saturnino Maldonado-Bascón,et al.  Extremely Overlapping Vehicle Counting , 2015, IbPRIA.

[39]  Qijun Chen,et al.  Revisiting Perspective Information for Efficient Crowd Counting , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Huaizu Jiang,et al.  Face Detection with the Faster R-CNN , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[41]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[42]  Haroon Idrees,et al.  Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds , 2018, ECCV.

[43]  Miaojing Shi,et al.  Weakly Supervised Object Localization Using Size Estimates , 2016, ECCV.

[44]  Deyu Meng,et al.  DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Shuo Yang,et al.  WIDER FACE: A Face Detection Benchmark , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Shenghua Gao,et al.  Single-Image Crowd Counting via Multi-Column Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Shiv Surya,et al.  Switching Convolutional Neural Network for Crowd Counting , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).