HPC AI500 V2.0: The Methodology, Tools, and Metrics for Benchmarking HPC AI Systems

Recent years witness a trend of applying large-scale distributed deep learning algorithms (HPC AI) in both business and scientific computing areas, whose goal is to speed up the training time to achieve a state-of-the-art quality. The HPC AI benchmarks accelerate the process. Unfortunately, benchmarking HPC AI systems at scale raises serious challenges. This paper presents a comprehensive HPC AI benchmarking methodology that achieves equivalence, representativeness, repeatability, and affordability. Among the nineteen AI workloads of AIBench Training–by far the most comprehensive AI benchmarks suite, we choose two representative and repeatable AI workloads in terms of both AI model and micro-architectural characteristics. The selected HPC AI benchmarks include both business and scientific computing: Image Classification and Extreme Weather Analytics. Finally, we propose three high levels of benchmarking and the corresponding rules to assure equivalence. To rank the performance of HPC AI systems, we present a new metric named Valid FLOPS, emphasizing both throughput performance and target quality. The evaluations show our methodology, benchmarks, and metrics can measure and rank the HPC AI systems in a simple, affordable and repeatable way. The specification, source code, datasets, and HPC AI500 ranking numbers are publicly available from https://www.benchcouncil.org/aibench/hpcai500/index.html.

[1]  Ioannis Mitliagkas,et al.  Deep Learning at 15PF : Supervised and Semi-Supervised Classification for Scientific Data , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Masafumi Yamazaki,et al.  Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds , 2019, ArXiv.

[3]  Olivier Richard,et al.  CONCURRENCY AND COMPUTATION : PRACTICE AND EXPERIENCE , 2018 .

[4]  C. Drummond Replicability is not Reproducibility:Nor is it Good Science , 2009 .

[5]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Kunle Olukotun,et al.  DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[8]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[9]  Scenario-distilling AI Benchmarking , 2020 .

[10]  Jidong Zhai,et al.  AIPerf: Automated machine learning as an AI-HPC benchmark , 2020, Big Data Min. Anal..

[11]  Tao Wang,et al.  Image Classification at Supercomputer Scale , 2018, ArXiv.

[12]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[13]  Wanling Gao,et al.  Comparison and Benchmarking of AI Models and Frameworks on Mobile Devices , 2020, ArXiv.

[14]  Fabio Maria Carlucci,et al.  NAS evaluation is frustratingly hard , 2020, ICLR.

[15]  Yuchen Zhang,et al.  HPC AI500: A Benchmark Suite for HPC AI Systems , 2018, Bench.

[16]  Pongsakorn U.-Chupala,et al.  ImageNet/ResNet-50 Training in 224 Seconds , 2018, ArXiv.

[17]  Qingquan Song,et al.  Auto-Keras: An Efficient Neural Architecture Search System , 2018, KDD.

[18]  Gu-Yeon Wei,et al.  Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Koji Ueno,et al.  Highly scalable graph search for the Graph500 benchmark , 2012, HPDC '12.

[20]  Prabhat,et al.  Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer , 2017, ArXiv.

[21]  Kai Hwang,et al.  Edge AIBench: Towards Comprehensive End-to-end Edge Computing Benchmarking , 2018, Bench.

[22]  Fan Zhang,et al.  AIoT Bench: Towards Comprehensive Benchmarking Mobile and Embedded Device Intelligence , 2018, Bench.

[23]  Torsten Hoefler,et al.  A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[24]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[25]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[26]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[28]  Lei Wang,et al.  AIBench Training: Balanced Industry-Standard AI Training Benchmarking , 2021, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[29]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Wanling Gao,et al.  HPC AI500: The Methodology, Tools, Roofline Performance Models, and Metrics for Benchmarking HPC AI Systems , 2020, ArXiv.

[31]  Sameer Kumar,et al.  PowerAI DDL , 2017, ArXiv.

[32]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[33]  Prabhat,et al.  CosmoFlow: Using Deep Learning to Learn the Universe at Scale , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  David Patterson,et al.  MLPerf Training Benchmark , 2019, MLSys.

[35]  Wanling Gao,et al.  AIBench Scenario: Scenario-Distilling AI Benchmarking , 2020, 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[36]  E Weinan,et al.  Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[41]  Takuya Akiba,et al.  Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.

[42]  Prabhat,et al.  ExtremeWeather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events , 2016, NIPS.

[43]  Amar Phanishayee,et al.  Benchmarking and Analyzing Deep Neural Network Training , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[44]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[45]  Vikram A. Saletore,et al.  Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train , 2017, ArXiv.

[46]  Pradeep Dubey,et al.  On Scale-out Deep Learning Training for Cloud and HPC , 2018, ArXiv.

[47]  Seiichi Ozawa,et al.  t-Distributed stochastic neighbor embedding spectral clustering , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[48]  Prabhat,et al.  Exascale Deep Learning for Climate Analytics , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[49]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[50]  Mikko H. Lipasti,et al.  BenchNN: On the broad potential application scope of hardware neural network accelerators , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[51]  Prabhat,et al.  Application of Deep Convolutional Neural Networks for Detecting Extreme Weather in Climate Datasets , 2016, ArXiv.

[52]  Jim Gray Database and Transaction Processing Performance Handbook , 1993, The Benchmark Handbook.

[53]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[54]  Laura Humphrey,et al.  Evaluating Parallel Extensions to High Level Languages Using the HPC Challenge Benchmarks , 2009, 2009 DoD High Performance Computing Modernization Program Users Group Conference.