FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

Protein structure prediction helps to understand gene translation and protein function, which is of growing interest and importance in structural biology. The AlphaFold model, which used transformer architecture to achieve atomic-level accuracy in protein structure prediction, was a significant breakthrough. However, training and inference of the AlphaFold model are challenging due to its high computation and memory cost. In this work, we present FastFold, an efficient implementation of AlphaFold for both training and inference. We propose Dynamic Axial Parallelism and Duality Async Operations to improve the scaling efficiency of model parallelism. Besides, AutoChunk is proposed to reduce memory cost by over 80% during inference by automatically determining the chunk strategy. Experimental results show that FastFold reduces overall training time from 11 days to 67 hours and achieves 7.5X - 9.5X speedup for long-sequence inference. Furthermore, we scale FastFold to 512 GPUs and achieve an aggregate throughput of 6.02 PetaFLOP/s with 90.1% parallel efficiency.

[1]  Reza Yazdani Aminabadi,et al.  DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale , 2022, ICML.

[2]  James Lin,et al.  ParaFold: Paralleling AlphaFold for Large-Scale Predictions , 2021, HPC Asia Workshops.

[3]  Xian Qian,et al.  LightSeq2: Accelerated Training for Transformer-Based Models on GPUs , 2021, SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  J. Korbel,et al.  AlphaDesign: A de novo protein design framework based on AlphaFold , 2021, bioRxiv.

[5]  R. Laskowski,et al.  AlphaFold heralds a data-driven revolution in biology and medicine , 2021, Nature Medicine.

[6]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[7]  Gyu Rie Lee,et al.  Accurate prediction of protein structures and interactions using a 3-track neural network , 2021, Science.

[8]  Olatunji Ruwase,et al.  ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Amar Phanishayee,et al.  Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.

[12]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[13]  Yang Yu,et al.  TurboTransformers: an efficient GPU serving system for transformer models , 2020, PPoPP.

[14]  Lucy J. Colwell,et al.  Rethinking Attention with Performers , 2020, ICLR.

[15]  Mingxuan Wang,et al.  LightSeq: A High Performance Inference Library for Transformers , 2021, NAACL.

[16]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[17]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[18]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[19]  David T. Jones,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[20]  Samyam Rajbhandari,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[22]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[23]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[24]  Pradeep Dubey,et al.  A Study of BFLOAT16 for Deep Learning Training , 2019, ArXiv.

[25]  Jinbo Xu Distance-based protein folding powered by deep learning , 2018, Proceedings of the National Academy of Sciences.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Matthew Johnson,et al.  Compiling machine learning programs via high-level tracing , 2018 .

[28]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[31]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[32]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[33]  B. Welford Note on a Method for Calculating Corrected Sums of Squares and Products , 1962 .