SwinUNet3D - A Hierarchical Architecture for Deep Traffic Prediction using Shifted Window Transformers

Traffic forecasting is an important element of mobility management, an important key that drives the logistics industry. Over the years, lots of work have been done in Traffic forecasting using time series as well as spatiotemporal dynamic forecasting. In this paper, we explore the use of vision transformer in a UNet setting. We completely remove all convolution-based building blocks in UNet, while using 3D shifted window transformer in both encoder and decoder branches. In addition, we experiment with the use of feature mixing just before patch encoding to control the inter-relationship of the feature while avoiding contraction of the depth dimension of our spatiotemporal input. The proposed network is tested on the data provided by Traffic Map Movie Forecasting Challenge 2021(Traffic4cast2021), held in the competition track of Neural Information Processing Systems (NeurIPS). Traffic4cast2021 task is to predict an hour (6 frames) of traffic conditions (volume and average speed)from one hour of given traffic state (12 frames averaged in 5 minutes time span). Source code is available online at https://github.com/bojesomo/Traffic4Cast2021-SwinUNet3D.

[1]  Paolo Torroni,et al.  Attention in Natural Language Processing , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Sungbin Choi,et al.  Utilizing UNet for the future traffic map prediction task Traffic4cast challenge 2020 , 2020, ArXiv.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Stephen Lin,et al.  Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yunbo Wang,et al.  Towards Good Practices of U-Net for Traffic Forecasting , 2020, ArXiv.

[6]  David P. Kreil,et al.  Traffic4cast at NeurIPS 2020 ? yet more on theunreasonable effectiveness of gridded geo-spatial processes , 2020, NeurIPS.

[7]  M Ashraful Amin,et al.  Unified Spatio-Temporal Modeling for Traffic Forecasting using Graph Neural Network , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[8]  Qinglong Zhang,et al.  ResT: An Efficient Transformer for Visual Recognition , 2021, NeurIPS.

[9]  Henry Martin,et al.  Graph-ResNets for short-term traffic forecasts in almost unknown cities , 2020, NeurIPS.

[10]  Qi Tian,et al.  Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation , 2021, ECCV Workshops.

[11]  Sanja Fidler,et al.  CrevNet: Conditionally Reversible Video Prediction , 2019, ArXiv.

[12]  Eugenio Culurciello,et al.  LinkNet: Exploiting encoder representations for efficient semantic segmentation , 2017, 2017 IEEE Visual Communications and Image Processing (VCIP).

[13]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[14]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[15]  Quoc V. Le,et al.  CoAtNet: Marrying Convolution and Attention for All Data Sizes , 2021, NeurIPS.

[16]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Sungbin Choi,et al.  Traffic map prediction using UNet based deep convolutional neural network , 2019, ArXiv.

[20]  Panos Liatsis,et al.  Traffic flow prediction using Deep Sedenion Networks , 2020, ArXiv.

[21]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[22]  Vladlen Koltun,et al.  Vision Transformers for Dense Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Wei Liu,et al.  CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention , 2021, ArXiv.

[24]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[25]  Nenghai Yu,et al.  CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Vladlen Koltun,et al.  Multi-Task Learning as Multi-Objective Optimization , 2018, NeurIPS.

[27]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.