AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

Convolutional operations have two limitations: (1) do not explicitly model where to focus as the same filter is applied to all the positions, and (2) are unsuitable for modeling long-range dependencies as they only operate on a small neighborhood. While both limitations can be alleviated by attention operations, many design choices remain to be determined to use attention, especially when applying attention to videos. Towards a principled way of applying attention to videos, we address the task of spatiotemporal attention cell search. We propose a novel search space for spatiotemporal attention cells, which allows the search algorithm to flexibly explore various design choices in the cell. The discovered attention cells can be seamlessly inserted into existing backbone networks, e.g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets. The discovered attention cells outperform non-local blocks on both datasets, and demonstrate strong generalization across different modalities, backbones, and datasets. Inserting our attention cells into I3D-R50 yields state-of-the-art performance on both datasets.

[1]  Ramesh Raskar,et al.  Designing Neural Network Architectures using Reinforcement Learning , 2016, ICLR.

[2]  Li Fei-Fei,et al.  Progressive Neural Architecture Search , 2017, ECCV.

[3]  Alan L. Yuille,et al.  Genetic CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Martin Jaggi,et al.  Evaluating the Search Phase of Neural Architecture Search , 2019, ICLR.

[5]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[6]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[10]  Li Fei-Fei,et al.  Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Xiaofang Wang,et al.  Learnable Embedding Space for Efficient Neural Architecture Compression , 2019, ICLR.

[13]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Wei Wu,et al.  Practical Block-Wise Neural Network Architecture Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Kirthevasan Kandasamy,et al.  Neural Architecture Search with Bayesian Optimisation and Optimal Transport , 2018, NeurIPS.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Hailin Jin,et al.  Learning Video Representations From Correspondence Proposals , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xiao Liu,et al.  StNet: Local and Global Spatial-Temporal Modeling for Action Recognition , 2018, AAAI.

[19]  Ameet Talwalkar,et al.  Random Search and Reproducibility for Neural Architecture Search , 2019, UAI.

[20]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[21]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[22]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[23]  Quoc V. Le,et al.  Large-Scale Evolution of Image Classifiers , 2017, ICML.

[24]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[25]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[29]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  In-So Kweon,et al.  BAM: Bottleneck Attention Module , 2018, BMVC.

[33]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Kaiming He,et al.  Exploring Randomly Wired Neural Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Michael S. Ryoo,et al.  AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures , 2019, ICLR.

[39]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[40]  Chen Sun,et al.  D3D: Distilled 3D Networks for Video Action Recognition , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.