State Space Model for New-Generation Network Alternative to Transformers: A Survey

In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred many researchers. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State Space Model (SSM), as a possible replacement for the self-attention based Transformer model, has drawn more and more attention in recent years. In this paper, we give the first comprehensive review of these works and also provide experimental comparisons and analysis to better demonstrate the features and advantages of SSM. Specifically, we first give a detailed description of principles to help the readers quickly capture the key ideas of SSM. After that, we dive into the reviews of existing SSMs and their various applications, including natural language processing, computer vision, graph, multi-modal and multi-media, point cloud/event stream, time series data, and other domains. In addition, we give statistical comparisons and analysis of these models and hope it helps the readers to understand the effectiveness of different structures on various tasks. Then, we propose possible research points in this direction to better promote the development of the theoretical model and application of SSM. More related works will be continuously updated on the following GitHub: https://github.com/Event-AHU/Mamba_State_Space_Model_Paper_List.

[1]  Fusheng Liu,et al.  From Generalization Analysis to Optimization Designs for State Space Models , 2024, 2405.02670.

[2]  Étienne David,et al.  Variational quantization for state space models , 2024, ArXiv.

[3]  S. Chaudhuri,et al.  Simba: Mamba augmented U-ShiftGCN for Skeletal Action Recognition in Videos , 2024, ArXiv.

[4]  Chenhao Ying,et al.  DGMamba: Domain Generalization via Generalized State Space Model , 2024, ArXiv.

[5]  Xiangyu Zhu,et al.  FusionMamba: Efficient Image Fusion with State Space Model , 2024, ArXiv.

[6]  Anwai Archit,et al.  ViM-UNet: Vision Mamba for Biomedical Segmentation , 2024, ArXiv.

[7]  Yixuan Li,et al.  3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion , 2024, ArXiv.

[8]  Bochao Zou,et al.  RhythmMamba: Fast Remote Physiological Measurement with Arbitrary Length Videos , 2024, ArXiv.

[9]  Zhenye Gan,et al.  MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection , 2024, ArXiv.

[10]  Weidong Yang,et al.  3DMambaIPF: A State Space Model for Iterative Point Cloud Filtering via Differentiable Rendering , 2024, ArXiv.

[11]  Zhengcong Fei,et al.  Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models , 2024, ArXiv.

[12]  Simon Stepputtis,et al.  Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation , 2024, ArXiv.

[13]  Hongruixuan Chen,et al.  ChangeMamba: Remote Sensing Change Detection With Spatiotemporal State Space Model , 2024, IEEE Transactions on Geoscience and Remote Sensing.

[14]  Arnab Sen Sharma,et al.  Locating and Editing Factual Associations in Mamba , 2024, ArXiv.

[15]  Man-On Pun,et al.  RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation , 2024, IEEE Geoscience and Remote Sensing Letters.

[16]  P. Xiao,et al.  RS-Mamba for Large Remote Sensing Image Dense Prediction , 2024, IEEE Transactions on Geoscience and Remote Sensing.

[17]  Kai Li,et al.  SPMamba: State-space model is all you need in speech separation , 2024, ArXiv.

[18]  Yuanzhi Cai,et al.  Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model , 2024, ArXiv.

[19]  E. J. Olucha,et al.  On the reduction of Linear Parameter-Varying State-Space models , 2024, ArXiv.

[20]  Jing Hao,et al.  T-Mamba: Frequency-Enhanced Gated Long-Range Dependency for Tooth 3D CBCT Segmentation , 2024, ArXiv.

[21]  Xiaopeng Fan,et al.  SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding , 2024, ArXiv.

[22]  Judy X Yang,et al.  HSIMamba: Hyperpsectral Imaging Efficient Feature Learning with Bidirectional State Space for Classification , 2024, ArXiv.

[23]  Toshihiro Ota Decision Mamba: Reinforcement Learning via Sequence Modeling with Selective State Spaces , 2024, ArXiv.

[24]  Tao Zhu,et al.  HARMamba: Efficient Wearable Sensor Human Activity Recognition Based on Bidirectional Selective SSM , 2024, ArXiv.

[25]  Ali Behrouz,et al.  MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection , 2024, ArXiv.

[26]  Pengchen Liang,et al.  UltraLight VM-UNet: Parallel Vision Mamba Significantly Reduces Parameters for Skin Lesion Segmentation , 2024, ArXiv.

[27]  Y. Shoham,et al.  Jamba: A Hybrid Transformer-Mamba Language Model , 2024, ArXiv.

[28]  Zhichao Xu RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers , 2024, 2403.18276.

[29]  Xinchao Wang,et al.  Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction , 2024, ArXiv.

[30]  N. Mesgarani,et al.  Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation , 2024, ArXiv.

[31]  Chenhongyi Yang,et al.  PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition , 2024, ArXiv.

[32]  Hao Tang,et al.  Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation , 2024, ArXiv.

[33]  M. Soltanolkotabi,et al.  Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models , 2024, ArXiv.

[34]  Jiangchao Yao,et al.  ReMamber: Referring Image Segmentation with Mamba Twister , 2024, ArXiv.

[35]  Pragaash Ponnusamy,et al.  Mechanistic Design and Scaling of Hybrid Architectures , 2024, ArXiv.

[36]  Md. Tanzim Hossain,et al.  Integrating Mamba Sequence Model and Hierarchical Upsampling Network for Accurate Semantic Segmentation of Multiple Sclerosis Legion , 2024, ArXiv.

[37]  Franccois Pomerleau,et al.  Proprioception Is All You Need: Terrain Classification for Boreal Forests , 2024, ArXiv.

[38]  Zhenheng Tang,et al.  VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting , 2024, ArXiv.

[39]  Guangqian Yang,et al.  CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification , 2024, ArXiv.

[40]  Zhumin Chen,et al.  Uncovering Selective State Space Model's Capabilities in Lifelong Sequential Recommendation , 2024, ArXiv.

[41]  M. Zeilinger,et al.  State Space Models as Foundation Models: A Control Theoretic Overview , 2024, ArXiv.

[42]  Hanzhi Yin,et al.  Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models , 2024, ArXiv.

[43]  André Rosa de Sousa Porfírio Correia,et al.  Music to Dance as Language Translation using Sequence Models , 2024, ArXiv.

[44]  B. N. Patro,et al.  SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series , 2024, ArXiv.

[45]  Siteng Huang,et al.  Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference , 2024, ArXiv.

[46]  Manas Mejari,et al.  Model order reduction of deep structured state-space models: A system-theoretic approach , 2024, ArXiv.

[47]  Bjorn Ommer,et al.  ZigMa: A DiT-style Zigzag Mamba Diffusion Model , 2024, ArXiv.

[48]  Guibo Luo,et al.  ProMamba: Prompt-Mamba for polyp segmentation , 2024, ArXiv.

[49]  Zijia Zhao,et al.  VL-Mamba: Exploring State Space Models for Multimodal Learning , 2024, ArXiv.

[50]  Pengchen Liang,et al.  H-vmunet: High-order Vision Mamba UNet for Medical Image Segmentation , 2024, ArXiv.

[51]  A. Coster,et al.  STG-Mamba: Spatial-Temporal Graph Learning via Selective State Space Model , 2024, ArXiv.

[52]  Xuefeng Xiao,et al.  VmambaIR: Visual State Space Model for Image Restoration , 2024, ArXiv.

[53]  Daling Wang,et al.  Is Mamba Effective for Time Series Forecasting? , 2024, ArXiv.

[54]  Yanxi Li,et al.  Understanding Robustness of Visual State Space Models for Image Classification , 2024, ArXiv.

[55]  Xiaohuan Pei,et al.  EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba , 2024, ArXiv.

[56]  Zhidi Lin,et al.  Regularization-Based Efficient Continual Learning in Deep State-Space Models , 2024, ArXiv.

[57]  Shan You,et al.  LocalMamba: Visual State Space Model with Windowed Selective Scan , 2024, ArXiv.

[58]  MingYa Zhang,et al.  VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation , 2024, ArXiv.

[59]  Md. Atik Ahamed,et al.  TimeMachine: A Time Series is Worth 4 Mambas for Long-term Forecasting , 2024, ArXiv.

[60]  Zhiqi Li,et al.  Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding , 2024, ArXiv.

[61]  Zunnan Xu,et al.  MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models , 2024, ArXiv.

[62]  Hang Wang,et al.  Activating Wider Areas in Image Super-Resolution , 2024, ArXiv.

[63]  Changsheng Quan,et al.  Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers , 2024, IEEE Signal Processing Letters.

[64]  Jintai Chen,et al.  Large Window-based Mamba UNet for Medical Image Segmentation: Beyond Convolution and Self-attention , 2024, ArXiv.

[65]  Yali Wang,et al.  VideoMamba: State Space Model for Efficient Video Understanding , 2024, ArXiv.

[66]  Vaishnavh Nagarajan,et al.  The pitfalls of next-token prediction , 2024, ICML.

[67]  Yu Zheng,et al.  Point Mamba: A Novel Point Cloud Backbone Based on State Space Model with Octree-Based Ordering Strategy , 2024, ArXiv.

[68]  Shu Yang,et al.  MambaMIL: Enhancing Long Sequence Modeling with Sequence Reordering in Computational Pathology , 2024, ArXiv.

[69]  A. Bihorac,et al.  A multi-cohort study on prediction of acute brain dysfunction states using selective state space models , 2024, ArXiv.

[70]  Avijit Mitra,et al.  ClinicalMamba: A Generative Clinical Language Model on Longitudinal Clinical Notes , 2024, CLINICALNLP.

[71]  Bowei Jiang,et al.  Long-term Frame-Event Visual Tracking: Benchmark Dataset and Baseline , 2024, ArXiv.

[72]  Shing Shin Cheng,et al.  Motion-Guided Dual-Camera Tracker for Low-Cost Skill Evaluation of Gastric Endoscopy , 2024, ArXiv.

[73]  Yinghao Zhu,et al.  LightM-UNet: Mamba Assists in Lightweight UNet for Medical Image Segmentation , 2024, ArXiv.

[74]  Zijie Fang,et al.  MamMIL: Multiple Instance Learning for Whole Slide Images with State Space Models , 2024, ArXiv.

[75]  Mohammad Reza Samsami,et al.  Mastering Memory Tasks with World Models , 2024, ArXiv.

[76]  James Caverlee,et al.  Mamba4Rec: Towards Efficient Sequential Recommendation with Selective State Space Models , 2024, ArXiv.

[77]  Yubiao Yue,et al.  MedMamba: Vision Mamba for Medical Image Classification , 2024, ArXiv.

[78]  Yair Schiff,et al.  Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling , 2024, ICML.

[79]  Jifeng Dai,et al.  Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures , 2024, ArXiv.

[80]  Zhentao Tan,et al.  MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection , 2024, ArXiv.

[81]  Ameen Ali,et al.  The Hidden Attention of Mamba Models , 2024, ArXiv.

[82]  Haobo Yuan,et al.  Point Cloud Mamba: Point Cloud Learning via State Space Model , 2024, ArXiv.

[83]  Zhuangwei Shi MambaStock: Selective state space model for stock prediction , 2024, ArXiv.

[84]  Antonio Orvieto,et al.  Theoretical Foundations of Deep Selective State-Space Models , 2024, ArXiv.

[85]  Angelica I. Avilés-Rivero,et al.  MambaMIR: An Arbitrary-Masked Mamba for Joint Medical Image Reconstruction and Uncertainty Estimation , 2024, ArXiv.

[86]  Yehui Tang,et al.  DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models , 2024, ArXiv.

[87]  Chi-Sheng Chen,et al.  Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning , 2024, ArXiv.

[88]  Zhihao Ouyang,et al.  MambaIR: A Simple Baseline for Image Restoration with State-Space Model , 2024, ArXiv.

[89]  Mathias Gehrig,et al.  State Space Models for Event Cameras , 2024, ArXiv.

[90]  K. Yan,et al.  Pan-Mamba: Effective pan-sharpening with State Space Model , 2024, ArXiv.

[91]  Ziqi Zhu,et al.  TLS-RWKV: Real-Time Online Action Detection with Temporal Label Smoothing , 2024, Neural Process. Lett..

[92]  Ziyang Wang,et al.  Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation , 2024, ArXiv.

[93]  Dingkang Liang,et al.  PointMamba: A Simple State Space Model for Point Cloud Analysis , 2024, ArXiv.

[94]  Raunaq M. Bhirangi,et al.  Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling , 2024, ArXiv.

[95]  Ali Behrouz,et al.  Graph Mamba: Towards Learning on Graphs with State Space Models , 2024, ArXiv.

[96]  Guanxi Li,et al.  P-Mamba: Marrying Perona Malik Diffusion with Mamba for Efficient Pediatric Echocardiographic Left Ventricular Segmentation , 2024, ArXiv.

[97]  Zhuoran Zheng,et al.  FD-Vision Mamba for Endoscopic Exposure Correction , 2024, ArXiv.

[98]  Shufan Li,et al.  Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data , 2024, ArXiv.

[99]  Zhengcong Fei,et al.  Scalable Diffusion Models with State Space Backbone , 2024, ArXiv.

[100]  Ziyang Wang,et al.  Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation , 2024, ArXiv.

[101]  Dimitris Papailiopoulos,et al.  Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks , 2024, ArXiv.

[102]  Hao Yang,et al.  Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining , 2024, ArXiv.

[103]  Julien N. Siems,et al.  Is Mamba Capable of In-Context Learning? , 2024, ArXiv.

[104]  Haifan Gong,et al.  nnMamba: 3D Biomedical Image Segmentation, Classification and Landmark Detection with State Space Model , 2024, ArXiv.

[105]  Jiacheng Ruan,et al.  VM-UNet: Vision Mamba UNet for Medical Image Segmentation , 2024, ArXiv.

[106]  Junlong Du,et al.  Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey , 2024, ArXiv.

[107]  Chloe X. Wang,et al.  Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces , 2024, ArXiv.

[108]  Mathieu Ravaut,et al.  LOCOST: State-Space Models for Long Document Abstractive Summarization , 2024, EACL.

[109]  Yijun Yang,et al.  Vivim: a Video Vision Mamba for Medical Video Object Segmentation , 2024, ArXiv.

[110]  Yijun Yang,et al.  SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation , 2024, ArXiv.

[111]  Yunjie Tian,et al.  VMamba: Visual State Space Model , 2024, ArXiv.

[112]  Bencheng Liao,et al.  Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model , 2024, ArXiv.

[113]  Haowen Hou,et al.  RWKV-TS: Beyond Traditional Recurrent Neural Network for Time Series Tasks , 2024, ArXiv.

[114]  Jun Ma,et al.  U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation , 2024, ArXiv.

[115]  Devendra Singh Chaplot,et al.  Mixtral of Experts , 2024, ArXiv.

[116]  Sebastian Jaszczur,et al.  MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts , 2024, ArXiv.

[117]  Xiao Wang,et al.  Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion , 2023, ArXiv.

[118]  Xiao Wang,et al.  Structural Information Guided Multimodal Pre-training for Vehicle-centric Perception , 2023, AAAI.

[119]  Elad Hazan,et al.  Spectral State Space Models , 2023, ArXiv.

[120]  Carl R. Andersson,et al.  Structured state-space models are deep Wiener models , 2023, ArXiv.

[121]  R. Panda,et al.  Gated Linear Attention Transformers with Hardware-Efficient Training , 2023, ArXiv.

[122]  Chenglong Li,et al.  SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm , 2023, ArXiv.

[123]  Antonio Orvieto,et al.  Recurrent Distance Filtering for Graph Representation Learning , 2023, 2312.01538.

[124]  Albert Gu,et al.  Mamba: Linear-Time Sequence Modeling with Selective State Spaces , 2023, ArXiv.

[125]  Jing Nathan Yan,et al.  Diffusion Models Without Attention , 2023, ArXiv.

[126]  Shida Wang,et al.  StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization , 2023, ArXiv.

[127]  Hermann Kumbong,et al.  FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores , 2023, ArXiv.

[128]  Tobias Katsch GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling , 2023, ArXiv.

[129]  Scott W. Linderman,et al.  Convolutional State Space Models for Long-Range Spatiotemporal Modeling , 2023, NeurIPS.

[130]  Y. Bengio,et al.  Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions , 2023, NeurIPS.

[131]  R. Herbrich,et al.  Hieros: Hierarchical Imagination on Structured State Space Sequence World Models , 2023, ArXiv.

[132]  Jonathan Berant,et al.  Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors , 2023, ArXiv.

[133]  N. Benjamin Erichson,et al.  Robustifying State-space Models for Long Sequences via Approximate Diagonalization , 2023, ArXiv.

[134]  Lin Zhu,et al.  Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline , 2023, ArXiv.

[135]  Beichen Xue,et al.  State-space Models with Layer-wise Nonlinearity are Universal Approximators with Exponential Decaying Memory , 2023, NeurIPS.

[136]  Yu Du,et al.  Spiking Structured State Space Model for Monaural Speech Enhancement , 2023, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[137]  J. Oswald,et al.  Gated recurrent neural networks discover attention , 2023, ArXiv.

[138]  Gao Huang,et al.  FLatten Transformer: Vision Transformer using Focused Linear Attention , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[139]  Li Dong,et al.  Retentive Network: A Successor to Transformer for Large Language Models , 2023, ArXiv.

[140]  Quentin G. Anthony,et al.  RWKV: Reinventing RNNs for the Transformer Era , 2023, EMNLP.

[141]  Pichao Wang,et al.  Selective Structured State-Spaces for Long-Form Video Understanding , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[142]  Xiaojun Chang,et al.  Dynamic Graph Enhanced Contrastive Learning for Chest X-Ray Report Generation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[143]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[144]  Caglar Gulcehre,et al.  Resurrecting Recurrent Neural Networks for Long Sequences , 2023, ICML.

[145]  Chris Xiaoxuan Lu,et al.  Structured State Space Models for In-Context Reinforcement Learning , 2023, NeurIPS.

[146]  Yonghong Tian,et al.  Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey , 2023, Machine Intelligence Research.

[147]  Jimmy Ba,et al.  Mastering Diverse Domains through World Models , 2023, ArXiv.

[148]  Khaled Kamal Saab,et al.  Hungry Hungry Hippos: Towards Language Modeling with State Space Models , 2022, ICLR.

[149]  Alexander M. Rush,et al.  Pretraining Without Attention , 2022, EMNLP.

[150]  Denis Xavier Charles,et al.  Efficient Long Sequence Modeling via State Space Augmented Transformer , 2022, ArXiv.

[151]  Yonghong Tian,et al.  Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric , 2022, ArXiv.

[152]  Qian Wang,et al.  A Simple Visual-Textual Baseline for Pedestrian Attribute Recognition , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[153]  Luke Zettlemoyer,et al.  Mega: Moving Average Equipped Gated Attention , 2022, ICLR.

[154]  Shinji Watanabe,et al.  TF-GRIDNET: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[155]  Scott W. Linderman,et al.  Simplified State Space Layers for Sequence Modeling , 2022, ICLR.

[156]  Junsong Yuan,et al.  AiATrack: Attention in Attention for Transformer Visual Tracking , 2022, ECCV.

[157]  Kaiqi Huang,et al.  Learning Disentangled Attribute Representations for Robust Pedestrian Attribute Recognition , 2022, AAAI.

[158]  Behnam Neyshabur,et al.  Long Range Language Modeling via Gated State Spaces , 2022, ICLR.

[159]  Christopher Ré,et al.  How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections , 2022, ICLR.

[160]  Shen Ge,et al.  Competence-based Multimodal Curriculum Learning for Medical Report Generation , 2022, ACL.

[161]  Christopher Ré,et al.  On the Parameterization and Initialization of Diagonal State Space Models , 2022, NeurIPS.

[162]  Mingsheng Shang,et al.  MCFL: multi-label contrastive focal loss for deep imbalanced pedestrian attribute recognition , 2022, Neural Computing and Applications.

[163]  Junyi Wu,et al.  Inter-Attribute awareness for pedestrian attribute recognition , 2022, Pattern Recognit..

[164]  Zengming Tang,et al.  DRFormer: Learning dual relations using Transformer for pedestrian attribute recognition , 2022, Neurocomputing.

[165]  Md. Mohaiminul Islam,et al.  Long Movie Clip Classification with State-Space Video Models , 2022, ECCV.

[166]  Jonathan Berant,et al.  Diagonal State Spaces are as Effective as Structured State Spaces , 2022, NeurIPS.

[167]  S. Shan,et al.  Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework , 2022, ECCV.

[168]  Limin Wang,et al.  MixFormer: End-to-End Tracking with Iterative Mixed Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[169]  L. Gool,et al.  Transforming Model Prediction for Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[170]  Wanli Ouyang,et al.  Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking , 2022, ECCV.

[171]  Hao Guo,et al.  Visual Attention Consistency for Human Attribute Recognition , 2022, International Journal of Computer Vision.

[172]  Albert Gu,et al.  It's Raw! Audio Generation with State-Space Models , 2022, ICML.

[173]  Xian Wu,et al.  Knowledge matters: Chest radiology report generation with general and specific knowledge , 2021, Medical Image Anal..

[174]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[175]  Eneko Agirre,et al.  Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey , 2021, ACM Comput. Surv..

[176]  Albert Gu,et al.  Efficiently Modeling Long Sequences with Structured State Spaces , 2021, ICLR.

[177]  Atri Rudra,et al.  Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers , 2021, NeurIPS.

[178]  Kaiqi Huang,et al.  Spatial and Semantic Consistency Regularizations for Pedestrian Attribute Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[179]  Jun Wan,et al.  Cascaded Split-and-Aggregate Learning with Feature Recombination for Pedestrian Attribute Recognition , 2021, International Journal of Computer Vision.

[180]  Jure Leskovec,et al.  Combiner: Full Attention Transformer with Sparse Computation Cost , 2021, NeurIPS.

[181]  Hao Tian,et al.  ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , 2021, ArXiv.

[182]  Xu Sun,et al.  Contrastive Attention for Automatic Chest X-ray Report Generation , 2021, FINDINGS.

[183]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[184]  Yuexian Zou,et al.  Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[185]  Shiliang Zhang,et al.  Large-Scale Spatio-Temporal Person Re-Identification: Algorithms and Benchmark , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[186]  Nitish Srivastava,et al.  An Attention Free Transformer , 2021, ArXiv.

[187]  Qi Tian,et al.  Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation , 2021, ECCV Workshops.

[188]  Wengang Zhou,et al.  Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[189]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[190]  Pichao Wang,et al.  TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[191]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[192]  Tsung-Hui Chang,et al.  Generating Radiology Reports via Memory-driven Transformer , 2020, EMNLP.

[193]  Mirco Ravanelli,et al.  Attention Is All You Need In Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[194]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[195]  Yilong Yin,et al.  CFVMNet: A Multi-branch Network for Vehicle Re-identification Based on Common Field of View , 2020, ACM Multimedia.

[196]  Shao-Yi Chien,et al.  Orientation-aware Vehicle Re-identification with Semantics-guided Part Attention Network , 2020, ECCV.

[197]  C. Ré,et al.  HiPPO: Recurrent Memory with Optimal Polynomial Projections , 2020, NeurIPS.

[198]  Ming Tang,et al.  Identity-Guided Human Semantic Parsing for Person Re-Identification , 2020, ECCV.

[199]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[200]  Rongrong Ji,et al.  Salience-Guided Cascaded Suppression Network for Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[201]  R. Chellappa,et al.  The Devil is in the Details: Self-Supervised Attention for Vehicle Re-Identification , 2020, ECCV.

[202]  Qingming Huang,et al.  Parsing-Based View-Aware Embedding Network for Vehicle Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[203]  Yang Yang,et al.  Relation-Aware Pedestrian Attribute Recognition with Graph Convolutional Networks , 2020, AAAI.

[204]  Luc Van Gool,et al.  Probabilistic Regression for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[205]  L. Gool,et al.  Know Your Surroundings: Exploiting Scene Information for Object Tracking , 2020, ECCV.

[206]  Hao Liu,et al.  Person Attribute Recognition by Sequence Contextual Relation Learning , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[207]  Gang Yu,et al.  High-Order Information Matters: Learning Relation and Topology for Occluded Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[208]  M. Zaghloul,et al.  IEEE Transactions , 2020, Computer.

[209]  Daguang Xu,et al.  When Radiology Report Generation Meets Knowledge Graph , 2020, AAAI.

[210]  Calton Pu,et al.  Looking GLAMORous: Vehicle Re-Id in Heterogeneous Cameras Networks with Global and Local Attention , 2020, ArXiv.

[211]  H. Ai,et al.  Rethinking the Distribution Gap of Person Re-identification with Camera-Based Batch Normalization , 2020, ECCV.

[212]  Wenjun Zeng,et al.  Uncertainty-Aware Multi-Shot Knowledge Distillation for Image-Based Object Re-Identification , 2020, AAAI.

[213]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[214]  Wei Jiang,et al.  Stripe-based and attribute-aware network: a two-branch deep model for vehicle re-identification , 2019, ArXiv.

[215]  Yu Wu,et al.  Pose-Guided Feature Alignment for Occluded Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[216]  Yichen Wei,et al.  Vehicle Re-Identification With Viewpoint-Aware Metric Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[217]  Chunhua Shen,et al.  Part-Guided Attention Learning for Vehicle Re-Identification , 2019, arXiv.org.

[218]  Yang Yang,et al.  ABD-Net: Attentive but Diverse Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[219]  Bing He,et al.  Part-Regularized Near-Duplicate Vehicle Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[220]  Xin Jin,et al.  Semantics-Aligned Representation Learning for Person Re-identification , 2019, AAAI.

[221]  Andrea Cavallaro,et al.  Omni-Scale Feature Learning for Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[222]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[223]  Cuiling Lan,et al.  Relation-Aware Global Attention for Person Re-Identification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[224]  Eric P. Xing,et al.  Knowledge-driven Encode, Retrieve, Paraphrase for Medical Image Report Generation , 2019, AAAI.

[225]  Wei Jiang,et al.  Bag of Tricks and a Strong Baseline for Deep Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[226]  B. Luo,et al.  Pedestrian Attribute Recognition: A Survey , 2019, Pattern Recognit..

[227]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[228]  Eric P. Xing,et al.  Hybrid Retrieval-Generation Reinforced Agent for Medical Image Report Generation , 2018, NeurIPS.

[229]  Xiong Chen,et al.  Learning Discriminative Features with Multiple Granularities for Person Re-Identification , 2018, ACM Multimedia.

[230]  Xuan Zhang,et al.  Multi-Target, Multi-Camera Tracking by Hierarchical Clustering: Recent Progress on DukeMTMC Project , 2017, CVPR 2017.

[231]  Longhui Wei,et al.  Person Transfer GAN to Bridge Domain Gap for Person Re-identification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[232]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[233]  Xiaogang Wang,et al.  HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[234]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[235]  Wu Liu,et al.  Large-scale vehicle re-identification in urban surveillance videos , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[236]  Tiejun Huang,et al.  Deep Relative Distance Learning: Tell the Difference between Similar Vehicles , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[237]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[238]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[239]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[240]  Clement J. McDonald,et al.  Preparing a collection of radiology examinations for distribution and retrieval , 2015, J. Am. Medical Informatics Assoc..

[241]  Xiaoou Tang,et al.  Pedestrian Attribute Recognition At Far Distance , 2014, ACM Multimedia.

[242]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[243]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[244]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[245]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[246]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[247]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[248]  Ziyang Wang,et al.  Semi-Mamba-UNet: Pixel-Level Contrastive Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation , 2024, ArXiv.

[249]  Yinuo Wang,et al.  MambaMorph: a Mamba-based Backbone with Contrastive Feature Learning for Deformable MR-CT Registration , 2024, ArXiv.

[250]  S. Baccus,et al.  S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces , 2022, NeurIPS.

[251]  Hai-Miao Hu,et al.  Correlation Graph Convolutional Network for Pedestrian Attribute Recognition , 2022, IEEE Transactions on Multimedia.

[252]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[253]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[254]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[255]  R. E. Kalman,et al.  A New Approach to Linear Filtering and Prediction Problems , 2002 .