A Comparison of ARM Against x86 for Distributed Machine Learning Workloads

The rise of Machine Learning (ML) in the last decade has created an unprecedented surge in demand for new and more powerful hardware. Various hardware approaches exist to take on these large demands motivating the need for hardware performance benchmarks to compare these diverse hardware systems. In this paper, we present a comprehensive analysis and comparison of available benchmark suites in the field of ML and related fields. The analysis of these benchmarks is used to discuss the potential of ARM processors within the context of ML deployments. Our paper concludes with a brief hardware performance comparison of modern, server-grade ARM and x86 processors using a benchmark suite selected from our survey.

[1]  Ananta Tiwari,et al.  Characterizing the Performance-Energy Tradeoff of Small ARM Cores in HPC Computation , 2014, Euro-Par.

[2]  Ananta Tiwari,et al.  Compute bottlenecks on the new 64-bit ARM , 2015, E2SC '15.

[3]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[4]  Karthikeyan Sankaralingam,et al.  Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[5]  Luiz Marcos Garcia Gonçalves,et al.  Towards green data centers: A comparison of x86 and ARM architectures power efficiency , 2012, J. Parallel Distributed Comput..

[6]  Sébastien Lafond,et al.  Cost and Energy Reduction Evaluation for ARM Based Web Servers , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[7]  Quan Chen,et al.  DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[8]  A. D. George An overview of RISC vs. CISC , 1990, [1990] Proceedings. The Twenty-Second Southeastern Symposium on System Theory.

[9]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[10]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[11]  Jing Zhang,et al.  OpenCL and the 13 dwarfs: a work in progress , 2012, ICPE '12.

[12]  Serge J. Belongie,et al.  SD-VBS: The San Diego Vision Benchmark Suite , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Luca Fanucci,et al.  Many-core platform with NoC interconnect for low cost and energy sustainable cloud server-on-chip , 2012, 2012 Sustainable Internet and ICT for Sustainability (SustainIT).

[15]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[16]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[17]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[18]  Yogesh L. Simmhan,et al.  ARM Wrestling with Big Data: A Study of ARM64 and x64 Servers for Data Intensive Workloads , 2017, ArXiv.

[19]  Gu-Yeon Wei,et al.  Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[20]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[21]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .