Investigating transformers in the decomposition of polygonal shapes as point collections

Transformers can generate predictions in two approaches: 1. auto-regressively by conditioning each sequence element on the previous ones, or 2. directly produce an output sequences in parallel. While research has mostly explored upon this difference on sequential tasks in NLP, we study the difference between auto-regressive and parallel prediction on visual set prediction tasks, and in particular on polygonal shapes in images because polygons are representative of numerous types of objects, such as buildings or obstacles for aerial vehicles. This is challenging for deep learning architectures as a polygon can consist of a varying carnality of points. We provide evidence on the importance of natural orders for Transformers, and show the benefit of decomposing complex polygons into collections of points in an auto-regressive manner.

[1]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[2]  Alex Sherstinsky,et al.  Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network , 2018, Physica D: Nonlinear Phenomena.

[3]  Philipp Dürnay,et al.  Detecting Empty Wireframe Objects on Micro-Air Vehicles: Applied for Gate Detection in Autonomous Drone Racing , 2018 .

[4]  Jonathon S. Hare,et al.  FSPool: Learning Set Representations with Featurewise Sort Pooling , 2019, ICLR.

[5]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[7]  Ying Wu,et al.  Does Learning Specific Features for Related Parts Help Human Pose Estimation? , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jonathon S. Hare,et al.  Deep Set Prediction Networks , 2019, NeurIPS.

[9]  Yuliya Tarabalka,et al.  End-to-End Learning of Polygons for Remote Sensing Image Classification , 2018, IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium.

[10]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[14]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[15]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[16]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[17]  Ian D. Reid,et al.  DeepSetNet: Predicting Sets with Deep Neural Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Jan Dirk Wegner,et al.  Topological Map Extraction From Overhead Images , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Navdeep Jaitly,et al.  Imputer: Sequence Modelling via Imputation and Dynamic Programming , 2020, ICML.

[21]  Alexey Shvets,et al.  TernausNetV2: Fully Convolutional Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[22]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[23]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[24]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[28]  Xiangyang Wang,et al.  Improving Human Pose Estimation with Self-Attention Generative Adversarial Networks , 2019, 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[29]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Michael A. Osborne,et al.  On the Limitations of Representing Functions on Sets , 2019, ICML.

[31]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jan Philip Göpfert,et al.  Deep Learning for Understanding Satellite Imagery: An Experimental Survey , 2020, Frontiers in Artificial Intelligence.

[33]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[34]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[35]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Justin Solomon,et al.  Polygonal Building Segmentation by Frame Field Learning , 2020, ArXiv.

[37]  David Hyunchul Shim,et al.  A direct visual servoing‐based framework for the 2016 IROS Autonomous Drone Racing Challenge , 2018, J. Field Robotics.