Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos

Deep video recognition is more computationally expensive than image recognition, especially on large-scale datasets like Kinetics [1]. Therefore, training scalability is essential to handle a large amount of videos. In this paper, we study the factors that impact the training scalability of video networks. We recognize three bottlenecks, including data loading (data movement from disk to GPU), communication (data movement over networking), and computation FLOPs. We propose three design guidelines to improve the scalability: (1) fewer FLOPs and hardware-friendly operator to increase the computation efficiency; (2) fewer input frames to reduce the data movement and increase the data loading efficiency; (3) smaller model size to reduce the networking traffic and increase the networking efficiency. With these guidelines, we designed a new operator Temporal Shift Module (TSM) that is efficient and scalable for distributed training. TSM model can achieve 1.8× higher throughput compared to previous I3D models. We scale up the training of the TSM model to 1,536 GPUs, with a mini-batch of 12,288 video clips/98,304 images, without losing the accuracy. With such hardware-aware model design, we are able to scale up the training on Summit supercomputer and reduce the training time on Kinetics dataset from 49 hours 55 minutes to 14 minutes 13 seconds, achieving a top-1 accuracy of 74.0%, which is 1.6× and 2.9× faster than previous 3D video models with higher accuracy.

[1]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[9]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[10]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[11]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[12]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yang You,et al.  Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[14]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[15]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[17]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[18]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[20]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[21]  Bronis R. de Supinski,et al.  The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.