Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting

In this paper, we proposed two modified neural network architectures based on SFANet and SegNet respectively for accurate and efficient crowd counting. Inspired by SFANet, the first model is attached with two novel multi-scale-aware modules, called ASSP and CAN. This model is named M-SFANet. The encoder of M-SFANet is enhanced with ASSP containing parallel atrous convolutional layers with different sampling rates and hence able to extract multi-scale features of the target object and incorporate larger context. To further deal with scale variation throughout an input image, we leverage contextual module called CAN which adaptively encodes the scales of the contextual information. The combination yields an effective model for counting in both dense and sparse crowd scenes. Based on SFANet decoder structure, M-SFANet's decoder has dual paths, for density map generation and attention map generation. The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet. This change provides the faster model while providing competitive counting performance. Designed for high-speed surveillance applications, M-SegNet has no additional multi-scale-aware module in order to not increase the complexity. Both models are encoder-decoder based architectures and are end-to-end trainable. We also conduct extensive experiments on five crowd counting datasets and one vehicle counting dataset to show that these modifications yield algorithms that could outperform some of state-of-the-art crowd counting methods. Codes are available at this https URL.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  R. Venkatesh Babu,et al.  Top-Down Feedback for Crowd Counting Convolutional Neural Network , 2018, AAAI.

[3]  Xiaogang Wang,et al.  Cross-scene crowd counting via deep convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Wei Lin,et al.  Learning From Synthetic Data for Crowd Counting in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Vishal M. Patel,et al.  HA-CCN: Hierarchical Attention-Based Crowd Counting Network , 2019, IEEE Transactions on Image Processing.

[6]  Haizhou Ai,et al.  End-to-end crowd counting via joint learning local and global count , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[7]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Haroon Idrees,et al.  Multi-source Multi-scale Counting in Extremely Dense Crowd Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[11]  Chao Lu,et al.  Dual Path Multi-Scale Fusion Networks with Attention for Crowd Counting , 2019, ArXiv.

[12]  Vishal M. Patel,et al.  Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Liang He,et al.  Adaptive Scenario Discovery for Crowd Counting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[16]  Shenghua Gao,et al.  Single-Image Crowd Counting via Multi-Column Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Haroon Idrees,et al.  Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds , 2018, ECCV.

[18]  Wangmeng Zuo,et al.  Perspective-Guided Convolution Networks for Crowd Counting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Pascal Fua,et al.  Context-Aware Crowd Counting , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Li Pan,et al.  ADCrowdNet: An Attention-Injective Deformable Convolutional Network for Crowd Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Liang Lin,et al.  Crowd Counting using Deep Recurrent Spatial-Aware Network , 2018, IJCAI.

[23]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[24]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[25]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[26]  Yihong Gong,et al.  Bayesian Loss for Crowd Count Estimation With Point Supervision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Ling Shao,et al.  Crowd Counting and Density Estimation by Trellis Encoder-Decoder Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ryuzo Okada,et al.  COUNT Forest: CO-Voting Uncertain Number of Targets Using Random Forest for Crowd Density Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Srinivas S. Kruthiventi,et al.  CrowdNet: A Deep Convolutional Network for Dense Crowd Counting , 2016, ACM Multimedia.

[30]  Shiv Surya,et al.  Switching Convolutional Neural Network for Crowd Counting , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hao Lu,et al.  From Open Set to Closed Set: Counting Objects by Spatial Divide-and-Conquer , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[33]  Alexander Hauptmann,et al.  Learning Spatial Awareness to Improve Crowd Counting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Qijun Chen,et al.  Revisiting Perspective Information for Efficient Crowd Counting , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Nuno Vasconcelos,et al.  Bayesian Poisson regression for crowd counting , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[37]  Fei Su,et al.  Scale Aggregation Network for Accurate and Efficient Crowd Counting , 2018, ECCV.

[38]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[39]  Saturnino Maldonado-Bascón,et al.  Extremely Overlapping Vehicle Counting , 2015, IbPRIA.

[40]  Daniel Oñoro-Rubio,et al.  Towards Perspective-Free Object Counting with Deep Learning , 2016, ECCV.

[41]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[42]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[43]  Yu Wang,et al.  A Deeply-Recursive Convolutional Network For Crowd Counting , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Guangshuai Gao,et al.  CNN-based Density Estimation and Crowd Counting: A Survey , 2020, ArXiv.

[45]  Andrew Zisserman,et al.  Learning To Count Objects in Images , 2010, NIPS.

[46]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[47]  Hugues Benoit-Cattin,et al.  Semi-supervised Learning for Segmentation Under Semantic Constraint , 2018, MICCAI.

[48]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[49]  Yuhong Li,et al.  CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  K. S. Venkatesh,et al.  People Counting in High Density Crowds from Still Images , 2015, ArXiv.

[51]  Changxin Gao,et al.  Scale Pyramid Network for Crowd Counting , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[52]  Guanbin Li,et al.  Crowd Counting With Deep Structured Scale Integration Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Wei Lin,et al.  C^3 Framework: An Open-source PyTorch Code for Crowd Counting , 2019, ArXiv.

[54]  Lior Wolf,et al.  Learning to Count with CNN Boosting , 2016, ECCV.

[55]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.