论文信息 - TorchBench: Benchmarking PyTorch with High API Surface Coverage

TorchBench: Benchmarking PyTorch with High API Surface Coverage

Deep learning (DL) has been a revolutionary technique in various domains. To facilitate the model development and deployment, many deep learning frameworks are proposed, among which PyTorch is one of the most popular solutions. The performance of ecosystem around PyTorch is critically important, which saves the costs of training models and reduces the response time of model inferences. In this paper, we propose TorchBench, a novel benchmark suite to study the performance of PyTorch software stack. Unlike existing benchmark suites, TorchBench encloses many representative models, covering a large PyTorch API surface. TorchBench is able to comprehensively characterize the performance of the PyTorch software stack, guiding the performance optimization across models, PyTorch framework, and GPU libraries. We show two practical use cases of TorchBench. (1) We profile TorchBench to identify GPU performance inefficiencies in PyTorch. We are able to optimize many performance bugs and upstream patches to the official PyTorch repository. (2) We integrate TorchBench into PyTorch continuous integration system. We are able to identify performance regression in multiple daily code checkins to prevent PyTorch repository from introducing performance bugs. TorchBench is open source and keeps evolving.

[1] N. Saxena,et al. DrGPU: A Top-Down Profiler for GPU Applications , 2023, ICPE.

[2] Francisco Massa,et al. Hybrid Transformers for Music Source Separation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] S. Belloni,et al. DeepBench , 2022, Proceedings of the 2022 workshop on 9th International Workshop of Testing Database Systems.

[4] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[5] J. Mellor-Crummey,et al. ValueExpert: exploring value patterns in GPU-accelerated applications , 2022, ASPLOS.

[6] Ross G. Miller,et al. Comparative evaluation of deep learning workloads for leadership-class systems , 2021, BenchCouncil Transactions on Benchmarks, Standards and Evaluations.

[7] R. Pappu,et al. AlphaFold and implications for intrinsically disordered proteins. , 2021, Journal of molecular biology.

[8] P. Coulibaly,et al. A deep learning model for predicting climate-induced disasters , 2021, Natural Hazards.

[9] J. Mellor-Crummey,et al. GVPROF: A Value Profiler for GPU-Based Clusters , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10] Qi Zhu,et al. Intuitive robot teleoperation for civil engineering operations with virtual reality and deep learning scene reconstruction , 2020, Adv. Eng. Informatics.

[11] CohenAlbert,et al. DNNFusion: accelerating deep neural networks execution with advanced operator fusion , 2020, ACM Trans. Archit. Code Optim..

[12] M. Zaheer,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[13] R. Fergus,et al. Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ICLR.

[14] Ira Kemelmacher-Shlizerman,et al. Background Matting: The World Is Your Green Screen , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] N. Kalantari,et al. Deep Slow Motion Video Reconstruction With Hybrid Imaging System , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Alexander M. Rush. Torch-Struct: Deep Structured Prediction Library , 2020, ACL.

[17] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[18] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[19] Alexander Aiken,et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[20] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[21] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[22] Cody A. Coleman,et al. MLPerf Training Benchmark , 2019, MLSys.

[23] Shaohuai Shi,et al. Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training , 2019, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).

[24] Minghe Yu,et al. AIBench: An Industry Standard Internet Service AI Benchmark Suite , 2019, ArXiv.

[25] Noah A. Smith,et al. Green AI , 2019, 1907.10597.

[26] Andrea Rosà,et al. Renaissance: benchmarking suite for parallel applications on the JVM , 2019, PLDI.

[27] Yinghai Lu,et al. Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[28] Quoc V. Le,et al. Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29] Yan Li,et al. The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Yuan He,et al. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[31] John Paulin Hansen,et al. Brain Computer Interface for Neuro-rehabilitation With Deep Learning Classification and Virtual Reality Feedback , 2019, AH.

[32] Shuchang Zhou,et al. Learning to Paint With Model-Based Deep Reinforcement Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[34] Raghuraman Krishnamoorthi,et al. Quantizing deep convolutional networks for efficient inference: A whitepaper , 2018, ArXiv.

[35] Senthil Yogamani,et al. Visual SLAM for Automated Driving: Exploring the Applications of Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36] Jelena Frtunikj,et al. Deep Learning for Self-Driving Cars: Chances and Challenges , 2018, 2018 IEEE/ACM 1st International Workshop on Software Engineering for AI in Autonomous Systems (SEFAIAS).

[37] Joseph Redmon,et al. YOLOv3: An Incremental Improvement , 2018, ArXiv.

[38] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[39] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Jung-Woo Ha,et al. StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[43] Boris Ginsburg,et al. Training Deep AutoEncoders for Collaborative Filtering , 2017, ArXiv.

[44] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[45] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[46] Kai Chen,et al. Collaborative filtering and deep learning based recommendation system for cold start items , 2017, Expert Syst. Appl..

[47] David R. Kaeli,et al. DNNMark: A Deep Neural Network Benchmark Suite for GPUs , 2017, GPGPU@PPoPP.

[48] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Song Han,et al. Trained Ternary Quantization , 2016, ICLR.

[51] Hanan Samet,et al. Pruning Filters for Efficient ConvNets , 2016, ICLR.

[52] Gu-Yeon Wei,et al. Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[53] Marios Anthimopoulos,et al. Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolutional Neural Network , 2016, IEEE Transactions on Medical Imaging.

[54] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[55] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[57] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[58] Diederik P. Kingma,et al. Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[59] Shirish Tatikonda,et al. Resource Elasticity for Large-Scale Machine Learning , 2015, SIGMOD Conference.

[60] Xiaodong He,et al. A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems , 2015, WWW.

[61] Shirish Tatikonda,et al. On optimizing machine learning workloads via kernel fusion , 2015, PPoPP.

[62] Yoshua Bengio,et al. Training deep neural networks with low precision multiplications , 2014 .

[63] Ye Wang,et al. Improving Content-based and Hybrid Music Recommendation using Deep Learning , 2014, ACM Multimedia.

[64] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[65] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[66] Gang Wang,et al. Deep Learning-Based Classification of Hyperspectral Data , 2014, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[67] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[68] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[69] Kunle Olukotun,et al. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[70] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[71] Allen D. Malony,et al. The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[72] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[73] James H. Anderson,et al. AMD GPUs as an Alternative to NVIDIA for Supporting Real-Time Workloads , 2020, ECRTS.

[74] Tatiana Shpeisman,et al. TensorFlow Graph Optimizations , 2019 .

[75] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[76] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[77] Kunle Olukotun,et al. DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[78] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[79] Dirk Schmidl,et al. Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir , 2011, Parallel Tools Workshop.

[80] Nathan R. Tallent,et al. HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..