BiLingUNet: Image Segmentation by Modulating Top-Down and Bottom-Up Visual Processing with Referring Expressions

We present BiLingUNet, a state-of-the-art model for image segmentation using referring expressions. BiLingUNet uses language to customize visual filters and outperforms approaches that concatenate a linguistic representation to the visual input. We find that using language to modulate both bottom-up and top-down visual processing works better than just making the top-down processing language-conditional. We argue that common 1x1 language-conditional filters cannot represent relational concepts and experimentally demonstrate that wider filters work better. Our model achieves state-of-the-art performance on four referring expression datasets.

[1]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Hanwang Zhang,et al.  Learning to Assemble Neural Module Tree Networks for Visual Grounding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Louis-Philippe Morency,et al.  Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[6]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[7]  Hwann-Tzong Chen,et al.  See-Through-Text Grouping for Referring Image Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[9]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[11]  Trevor Darrell,et al.  Segmentation from Natural Language Expressions , 2016, ECCV.

[12]  Pablo Arbeláez,et al.  Dynamic Multimodal Instance Segmentation guided by natural language queries , 2018, ECCV.

[13]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[14]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[15]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Hao Chen,et al.  Deep Contextual Networks for Neuronal Structure Segmentation , 2016, AAAI.

[17]  Xiaojuan Qi,et al.  Referring Image Segmentation via Recurrent Refinement Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Hugo Jair Escalante,et al.  The segmented and annotated IAPR TC-12 benchmark , 2010, Comput. Vis. Image Underst..

[19]  Dumitru Erhan,et al.  Scalable Object Detection Using Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[22]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[23]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[25]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[26]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[27]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[29]  Hongliang Li,et al.  Key-Word-Aware Network for Referring Expression Image Segmentation , 2018, ECCV.

[30]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[34]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[36]  Lianli Gao,et al.  Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[41]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Luc De Raedt,et al.  Learning from Implicit Information in Natural Language Instructions for Robotic Manipulations , 2019, ArXiv.

[43]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[44]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Chenxi Liu,et al.  Recurrent Multimodal Interaction for Referring Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Hao Chen,et al.  3D deeply supervised network for automated segmentation of volumetric medical images , 2017, Medical Image Anal..

[47]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[48]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[49]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[50]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[51]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[52]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).