Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.

[1]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[2]  Zied Marrakchi,et al.  Towards Synthetic Benchmarks Generator for CAD Tool Evaluation , 2012 .

[3]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[4]  David A. Patterson,et al.  For better or worse, benchmarks shape a field , 2012, Commun. ACM.

[5]  Ronald G. Dreslinski,et al.  Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[6]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[9]  Jan M. Van Campenhout,et al.  Generating synthetic benchmark circuits for evaluating CAD tools , 2000, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[10]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[11]  Qi Guo,et al.  BenchIP: Benchmarking Intelligence Processors , 2017, Journal of Computer Science and Technology.

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[15]  Gu-Yeon Wei,et al.  Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[17]  Mohak Shah,et al.  Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning , 2015, ArXiv.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Lieven Eeckhout,et al.  The Return of Synthetic Benchmarks , 2008 .

[20]  Bertrand A. Maher,et al.  Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.

[21]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .

[22]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[23]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[24]  Kunle Olukotun,et al.  DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[25]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[26]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[28]  Muhammad Saleem,et al.  FEASIBLE: A Feature-Based SPARQL Benchmark Generation Framework , 2015, SEMWEB.

[29]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Lizy Kurian John,et al.  System-level Max POwer (SYMPO) - a systematic approach for escalating system-level power consumption using synthetic benchmarks , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Quan Chen,et al.  DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[32]  Shujia Zhou,et al.  Case study for running HPC applications in public clouds , 2010, HPDC '10.

[33]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[34]  F. Downton,et al.  Introduction to Mathematical Statistics , 1959 .

[35]  Mikko H. Lipasti,et al.  BenchNN: On the broad potential application scope of hardware neural network accelerators , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[36]  Khyati Kothari Comparison of several cloud computing providers , 2011 .

[37]  Won Woo Ro,et al.  Workload synthesis: Generating benchmark workloads from statistical execution profile , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[38]  Elad Hoffer,et al.  Scalable Methods for 8-bit Training of Neural Networks , 2018, NeurIPS.

[39]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[40]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[41]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[42]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[43]  H. T. Kung Why systolic architectures? , 1982, Computer.

[44]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[45]  Dario Floreano,et al.  GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods , 2011, Bioinform..

[46]  Qiang Wang,et al.  Benchmarking State-of-the-Art Deep Learning Software Tools , 2016, 2016 7th International Conference on Cloud Computing and Big Data (CCBD).

[47]  Wei Wei,et al.  AI Matrix - Synthetic Benchmarks for DNN , 2018, ArXiv.

[48]  Saturnino Garcia,et al.  CortexSuite: A synthetic brain benchmark suite , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).