暂无分享,去创建一个
Christoph Feichtenhofer | Yanghao Li | Jitendra Malik | Bo Xiong | Karttikeya Mangalam | Haoqi Fan | Zhicheng Yan | J. Malik | Haoqi Fan | Christoph Feichtenhofer | Bo Xiong | Yanghao Li | Karttikeya Mangalam | Zhicheng Yan | K. Mangalam
[1] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[3] Lijuan Wang,et al. End-to-End Human Pose and Mesh Reconstruction with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Chunhua Shen,et al. End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Cees Snoek,et al. VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..
[6] J. Koenderink. The structure of images , 2004, Biological Cybernetics.
[7] Andrew Zisserman,et al. A Short Note about Kinetics-600 , 2018, ArXiv.
[8] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[9] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[10] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[11] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[12] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[13] Myle Ott,et al. Understanding Back-Translation at Scale , 2018, EMNLP.
[14] Wankou Yang,et al. TransPose: Towards Explainable Human Pose Estimation by Transformer , 2020, ArXiv.
[15] Pichao Wang,et al. TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[16] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.
[17] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[18] Wei Liu,et al. CPTR: Full Transformer Network for Image Captioning , 2021, ArXiv.
[19] Yi Yang,et al. Random Erasing Data Augmentation , 2017, AAAI.
[20] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.
[21] Edward H. Adelson,et al. The Laplacian Pyramid as a Compact Image Code , 1983, IEEE Trans. Commun..
[22] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[23] Quoc V. Le,et al. Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[24] Ronghang Hu,et al. Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer , 2021, ArXiv.
[25] Aaron B. Adcock,et al. PyTorchVideo: A Deep Learning Library for Video Understanding , 2021, ACM Multimedia.
[26] Cordelia Schmid,et al. Actor-Centric Relation Network , 2018, ECCV.
[27] David Rolnick,et al. How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.
[28] Ivan Laptev,et al. Training Vision Transformers for Image Retrieval , 2021, ArXiv.
[29] Torsten Hoefler,et al. Augment Your Batch: Improving Generalization Through Instance Repetition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[31] Wei Wu,et al. STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[32] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[33] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[34] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[35] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[36] Stephen Lin,et al. Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[37] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[38] D. Hubel,et al. Receptive fields of optic nerve fibres in the spider monkey , 1960, The Journal of physiology.
[39] Kai Zhao,et al. Res2Net: A New Multi-Scale Backbone Architecture , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[40] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.
[41] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.
[42] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[43] Abhinav Gupta,et al. Videos as Space-Time Region Graphs , 2018, ECCV.
[44] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[45] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[46] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.
[47] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[48] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[49] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[51] Omri Bar,et al. Video Transformer Network , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).
[52] Rogério Schmidt Feris,et al. Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition , 2018, ICLR.
[53] Vladlen Koltun,et al. Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[54] Chongruo Wu,et al. ResNeSt: Split-Attention Networks , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[55] Han Fang,et al. Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.
[56] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[57] Ross B. Girshick,et al. Fast and Accurate Model Scaling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[58] MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers , 2020, ArXiv.
[59] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[60] Liu Yang,et al. Sparse Sinkhorn Attention , 2020, ICML.
[61] Ashish Vaswani,et al. Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.
[62] Arnold W. M. Smeulders,et al. Timeception for Complex Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Yichen Wei,et al. Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[64] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.
[65] Shuicheng Yan,et al. Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks With Octave Convolution , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[66] Wengang Zhou,et al. Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection for Autonomous Driving , 2020, IEEE Transactions on Circuits and Systems for Video Technology.
[67] Ralph R. Martin,et al. PCT: Point cloud transformer , 2020, Computational Visual Media.
[68] Heng Wang,et al. Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[69] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[70] Bo Zhang,et al. Do We Really Need Explicit Position Encodings for Vision Transformers? , 2021, ArXiv.
[71] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[72] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[73] Suha Kwak,et al. MotionSqueeze: Neural Motion Feature Learning for Video Understanding , 2020, ECCV.
[74] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[75] Guokun Lai,et al. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , 2020, NeurIPS.
[76] Georg Heigold,et al. Object-Centric Learning with Slot Attention , 2020, NeurIPS.
[77] Jan Kautz,et al. Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[78] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[79] Shuicheng Yan,et al. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.
[80] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[81] Kurt Keutzer,et al. Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.
[82] Karan Desai,et al. VirTex: Learning Visual Representations from Textual Annotations , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[83] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.
[84] Vishal M. Patel,et al. Medical Transformer: Gated Axial-Attention for Medical Image Segmentation , 2021, MICCAI.
[85] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.
[86] Eric Tzeng,et al. Toward Transformer-Based Object Detection , 2020, ArXiv.
[87] Bin Kang,et al. TEA: Temporal Excitation and Aggregation for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[88] Christoph Feichtenhofer,et al. X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[89] A. Rosenfeld,et al. Edge and Curve Detection for Visual Scene Analysis , 1971, IEEE Transactions on Computers.
[90] Gedas Bertasius,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[91] Lukasz Kaiser,et al. Rethinking Attention with Performers , 2020, ArXiv.
[92] Andrew Zisserman,et al. A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.
[93] Kaiming He,et al. Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[94] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[95] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[96] Song Han,et al. Temporal Shift Module for Efficient Video Understanding , 2018, ArXiv.
[97] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[98] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[99] Rui Li,et al. Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation , 2020, ArXiv.
[100] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[101] Zhou Yu,et al. Multimodal Transformer With Multi-View Visual Representation for Image Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.
[102] Irwan Bello. LambdaNetworks: Modeling Long-Range Interactions Without Attention , 2021, ICLR.
[103] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .
[104] Kunihiko Fukushima,et al. Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition , 1982 .
[105] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[106] Yan Wang,et al. SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation , 2021, ArXiv.
[107] Yan Wang,et al. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation , 2021, ArXiv.
[108] Doyen Sahoo,et al. Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems , 2019, ACL.
[109] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[110] Ilya Sutskever,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.
[111] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[112] Kaiming He,et al. Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[113] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.
[114] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.
[115] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[116] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.