Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the fast-paced software and hardware solutions in this space. Two fundamental challenges to make this scalable are (i) workload representativeness and (ii) the ability to quickly incorporate changes to the fleet into the benchmarks. To overcome these issues, we propose Mystique, an accurate and scalable framework for production AI benchmark generation. It leverages the PyTorch execution trace (ET), a new feature that captures the runtime information of AI models at the granularity of operators, in a graph format, together with their metadata. By sourcing fleet ETs, we can build AI benchmarks that are portable and representative. Mystique is scalable, due to its lightweight data collection, in terms of runtime overhead and instrumentation effort. It is also adaptive because ET composability allows flexible control on benchmark creation. We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models, both in execution time and system-level metrics. We also showcase the portability of the generated benchmarks across platforms, and demonstrate several use cases enabled by the fine-grained composability of the execution trace.

[1]  Christina Delimitrou,et al.  Ditto: End-to-End Application Cloning for Networked Cloud Services , 2023, International Conference on Architectural Support for Programming Languages and Operating Systems.

[2]  E. K. Ardestani,et al.  Building a Performance Model for Deep Learning Recommendation Model Training on GPUs , 2022, 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[3]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[4]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[5]  Carole-Jean Wu,et al.  Sustainable AI: Environmental Implications, Challenges and Opportunities , 2021, MLSys.

[6]  Muhammet Mustafa Ozdal,et al.  Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product , 2021, ISCA.

[7]  Javier Duarte,et al.  MLPerf Tiny Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[8]  Christina Delimitrou,et al.  Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices , 2020 .

[9]  Ajay Joshi,et al.  AI Tax in Mobile SoCs: End-to-end Performance Analysis of Machine Learning in Smartphones , 2021, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Doe Hyun Yoon,et al.  The Design Process for Google's Training Chips: TPUv2 and TPUv3 , 2021, IEEE Micro.

[11]  Shih-Hao Hung,et al.  PerfNetRT: Platform-Aware Performance Modeling for Optimized Deep Neural Networks , 2020, 2020 International Computer Symposium (ICS).

[12]  Carole-Jean Wu,et al.  Chasing Carbon: The Elusive Environmental Footprint of Computing , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[13]  Carole-Jean Wu,et al.  Cross-Stack Workload Characterization of Deep Recommendation Systems , 2020, 2020 IEEE International Symposium on Workload Characterization (IISWC).

[14]  Mikko H. Lipasti,et al.  MicroGrad: A Centralized Framework for Workload Cloning and Stress Testing , 2020, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[15]  Ramesh Radhakrishnan,et al.  Demystifying the MLPerf Training Benchmark Suite , 2020, 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[16]  Amar Phanishayee,et al.  Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training , 2020, USENIX Annual Technical Conference.

[17]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[18]  Shijian Li,et al.  Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers , 2020, 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS).

[19]  Carole-Jean Wu,et al.  MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance , 2020, IEEE Micro.

[20]  Ankit Patel,et al.  Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[22]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[23]  Cody A. Coleman,et al.  MLPerf Training Benchmark , 2019, MLSys.

[24]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[25]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[26]  Joseph McMahan,et al.  Safer Program Behavior Sharing Through Trace Wringing , 2019, ASPLOS.

[27]  Yuan He,et al.  An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[28]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[29]  A. Stephen McGough,et al.  Predicting the Computational Cost of Deep Learning Models , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[30]  Tor M. Aamodt,et al.  Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling , 2018, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[31]  Amar Phanishayee,et al.  Benchmarking and Analyzing Deep Neural Network Training , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[32]  Christina Delimitrou,et al.  The Architectural Implications of Cloud Microservices , 2018, IEEE Computer Architecture Letters.

[33]  Reena Panda,et al.  CAMP: Accurate modeling of core and memory locality for proxy generation of big-data applications , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[34]  Guorui Zhou,et al.  Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.

[35]  Reena Panda,et al.  Statistical pattern based modeling of GPU memory access streams , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[36]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[37]  Yan Solihin,et al.  Clone morphing: Creating new workload behavior from existing applications , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[38]  Reena Panda,et al.  Proxy Benchmarks for Emerging Big-Data Workloads , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[39]  Gu-Yeon Wei,et al.  Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[40]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[41]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Hai Jin,et al.  GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation , 2015, IEEE Transactions on Computers.

[43]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Lizy Kurian John,et al.  Automatic Generation of Miniaturized Synthetic Proxies for Target Applications to Efficiently Design Multicore Processors , 2014, IEEE Transactions on Computers.

[45]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[46]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[47]  Christina Delimitrou,et al.  ECHO: Recreating network traffic maps for datacenters with tens of thousands of servers , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[48]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[49]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[50]  Lieven Eeckhout,et al.  Dispersing proprietary applications as benchmarks through code mutation , 2008, ASPLOS.

[51]  Carlos González,et al.  ATTILA: a cycle-level execution-driven simulator for modern GPU architectures , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[52]  Gennady Pekhimenko,et al.  Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach , 2021, ArXiv.

[53]  S. Sagar Imambi,et al.  PyTorch , 2021, Programming with TensorFlow.

[54]  冯利芳 Facebook , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[55]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[56]  Kunle Olukotun,et al.  DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[57]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.