Dense Prediction with Attentive Feature Aggregation

Aggregating information from features across different layers is essential for dense prediction models. Despite its limited expressiveness, vanilla feature concatenation dominates the choice of aggregation operations. In this paper, we introduce Attentive Feature Aggregation (AFA) to fuse different network layers with more expressive non-linear operations. AFA exploits both spatial and channel attention to compute weighted averages of the layer activations. Inspired by neural volume rendering, we further extend AFA with Scale-Space Rendering (SSR) to perform a late fusion of multi-scale predictions. AFA is applicable to a wide range of existing network designs. Our experiments show consistent and significant improvements on challenging semantic segmentation benchmarks, including Cityscapes and BDD100K at negligible computational and parameter overhead. In particular, AFA improves the performance of the Deep Layer Aggregation (DLA) model by nearly 6% mIoU on Cityscapes. Our experimental analyses show that AFA learns to progressively refine segmentation maps and improve boundary details, leading to new state-of-the-art results on boundary detection benchmarks on NYUDv2 and BSDS500.

[1]  Clayton D. Scott,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Matti Pietikäinen,et al.  Pixel Difference Networks for Efficient Edge Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Cheng He,et al.  FaPN: Feature-aligned Pyramid Network for Dense Image Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[6]  Cordelia Schmid,et al.  Segmenter: Transformer for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[8]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jianping Shi,et al.  Improving Semantic Segmentation via Decoupled Body and Edge Supervision , 2020, ECCV.

[10]  Jinhui Tang,et al.  Feature Pyramid Transformer , 2020, ECCV.

[11]  Xilin Chen,et al.  SegFix: Model-Agnostic Boundary Refinement for Segmentation , 2020, ECCV.

[12]  Zheng Zhang,et al.  Disentangled Non-Local Neural Networks , 2020, ECCV.

[13]  Karan Sapra,et al.  Hierarchical Multi-Scale Attention for Semantic Segmentation , 2020, ArXiv.

[14]  Kuiyuan Yang,et al.  GFF: Gated Fully Fusion for Semantic Segmentation , 2019, AAAI.

[15]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[16]  A. Yuille,et al.  Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation , 2020, ECCV.

[17]  Kuiyuan Yang,et al.  Semantic Flow for Fast and Accurate Scene Parsing , 2020, ECCV.

[18]  Maxwell D. Collins,et al.  Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xilin Chen,et al.  Object-Contextual Representations for Semantic Segmentation , 2019, ECCV.

[20]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[22]  Yang Wang,et al.  Region Mutual Information Loss for Semantic Segmentation , 2019, NeurIPS.

[23]  Jun Fu,et al.  Adaptive Context Network for Scene Parsing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Sanja Fidler,et al.  Gated-SCNN: Gated Shape CNNs for Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Dani Lischinski,et al.  ZigZagNet: Fusing Top-Down and Bottom-Up Context for Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Lorenzo Porzi,et al.  Seamless Scene Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ming Yang,et al.  Bi-Directional Cascade Network for Perceptual Edge Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  George Papandreou,et al.  DeeperLab: Single-Shot Image Parser , 2019, ArXiv.

[29]  Shawn D. Newsam,et al.  Improving Semantic Segmentation via Video Propagation and Label Relaxation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Shengjun Liu,et al.  Learning to predict crisp boundaries , 2018, ECCV.

[33]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[34]  Gang Peng,et al.  Attention to Refine through Multi-Scales for Semantic Segmentation , 2018, PCM.

[35]  Kun Yu,et al.  DenseASPP for Semantic Segmentation in Street Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[37]  Lorenzo Porzi,et al.  In-place Activated BatchNorm for Memory-Optimized Training of DNNs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Trevor Darrell,et al.  Deep Layer Aggregation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Xiaojuan Qi,et al.  ICNet for Real-Time Semantic Segmentation on High-Resolution Images , 2017, ECCV.

[40]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Peter Kontschieder,et al.  The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[43]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[44]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Xiang Bai,et al.  Richer Convolutional Features for Edge Detection , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[50]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[52]  Saining Xie,et al.  Holistically-Nested Edge Detection , 2015, International Journal of Computer Vision.

[53]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  M. Ament,et al.  Volume Rendering , 2015 .

[55]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[56]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[58]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  H. Robbins A Stochastic Approximation Method , 1951 .