Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), which has low GPU utilization (i.e., the percentage of per-batch training time when kernels are running on the device) compared to other well-optimized vision (CV) and natural language processing (NLP) models. We show that both the device active time (the sum of kernel runtimes) and idle time are important components of the overall device time, and can be tackled separately by (1) flexibly adopting heuristic- and ML-based kernel performance models for kernels that dominate the device active time, and (2) categorizing operator overheads into five types to quantitatively determine their contribution to the overall device time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean absolute error (GMAE) in all kernel performance modeling, and 5.23% and 7.96% geomean errors, respectively, for GPU active time and overall end-to-end per-batch training time prediction on the highly-customized and multi-factor dominated DLRM architectures. We also demonstrate our performance model’s ability to generalize to other compute-bound DL models targeted by most previous methods and better assist general model-system co-design than previous work.

[1]  Christos-Savvas Bouganis,et al.  perf4sight: A toolflow to model CNN training performance on Edge GPUs , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[2]  Shih-Hao Hung,et al.  PerfNetRT: Platform-Aware Performance Modeling for Optimized Deep Neural Networks , 2020, 2020 International Computer Symposium (ICS).

[3]  A. Chandrashekar,et al.  Learning Representations of Hierarchical Slates in Collaborative Filtering , 2020, RecSys.

[4]  Yuandong Tian,et al.  Towards Automated Neural Interaction Discovery for Click-Through Rate Prediction , 2020, KDD.

[5]  Amar Phanishayee,et al.  Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training , 2020, USENIX Annual Technical Conference.

[6]  Alexander Heinecke,et al.  Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Shijian Li,et al.  Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers , 2020, 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS).

[8]  Ping Li,et al.  Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.

[9]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[10]  Moshe Tennenholtz,et al.  Rethinking search engines and recommendation systems , 2019, Commun. ACM.

[11]  Jingyuan Zhang,et al.  AIBox: CTR Prediction Model Training on a Single Node , 2019, CIKM.

[12]  Cody A. Coleman,et al.  MLPerf Training Benchmark , 2019, MLSys.

[13]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[14]  Mattan Erez,et al.  DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[15]  A. Stephen McGough,et al.  Predicting the Computational Cost of Deep Learning Models , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[16]  Sriram Krishnamoorthy,et al.  TTLG - An Efficient Tensor Transposition Library for GPUs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[17]  Xing Xie,et al.  xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems , 2018, KDD.

[18]  Yannis Cotronis,et al.  A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling , 2017, J. Parallel Distributed Comput..

[19]  Ruoxi Wang,et al.  Deep & Cross Network for Ad Click Predictions , 2017, ADKDD@KDD.

[20]  Guorui Zhou,et al.  Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Greg Linden,et al.  Two Decades of Recommender Systems at Amazon.com , 2017, IEEE Internet Computing.

[23]  Yunming Ye,et al.  DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.

[24]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[25]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Juan Gómez-Luna,et al.  In-Place Matrix Transposition on GPUs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[30]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[31]  John Riedl,et al.  Explaining collaborative filtering recommendations , 2000, CSCW '00.

[32]  Strother H. Walker,et al.  Estimation of the probability of an event as a function of several independent variables. , 1967, Biometrika.

[33]  Gennady Pekhimenko,et al.  Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach , 2021, ArXiv.

[34]  冯利芳 Facebook , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[35]  Guo Wei,et al.  Iteration Time Prediction for CNN in Multi-GPU Platform: Modeling and Analysis , 2019, IEEE Access.

[36]  George Karypis,et al.  A Comprehensive Survey of Neighborhood-based Recommendation Methods , 2011, Recommender Systems Handbook.