Scale-Out vs Scale-Up

ARM 64-bit processing has generated enthusiasm to develop ARM-based servers that are targeted for both data centers and supercomputers. In addition to the server-class components and hardware advancements, the ARM software environment has grown substantially over the past decade. Major development ecosystems and libraries have been ported and optimized to run on ARM, making ARM suitable for server-class workloads. There are two trends in available ARM SoCs: mobile-class ARM SoCs that rely on the heterogeneous integration of a mix of CPU cores, GPGPU streaming multiprocessors (SMs), and other accelerators, and the server-class SoCs that instead rely on integrating a larger number of CPU cores with no GPGPU support and a number of IO accelerators. For scaling the number of processing cores, there are two different paradigms: mobile-class SoCs that use scale-out architecture in the form of a cluster of simpler systems connected over a network, and server-class ARM SoCs that use the scale-up solution and leverage symmetric multiprocessing to pack a large number of cores on the chip. In this article, we present ScaleSoC cluster, which is a scale-out solution based on mobile class ARM SoCs. ScaleSoC leverages fast network connectivity and GPGPU acceleration to improve performance and energy efficiency compared to previous ARM scale-out clusters. We consider a wide range of modern server-class parallel workloads to study both scaling paradigms, including latency-sensitive transactional workloads, MPI-based CPU and GPGPU-accelerated scientific applications, and emerging artificial intelligence workloads. We study the performance and energy efficiency of ScaleSoC compared to server-class ARM SoCs and discrete GPGPUs in depth. We quantify the network overhead on the performance of ScaleSoC and show that packing a large number of ARM cores on a single chip does not necessarily guarantee better performance, due to the fact that shared resources, such as last-level cache, become performance bottlenecks. We characterize the GPGPU accelerated workloads and demonstrate that for applications that can leverage the better CPU-GPGPU balance of the ScaleSoC cluster, performance and energy efficiency improve compared to discrete GPGPUs.

[1]  H. Abdi Partial Least Square Regression PLS-Regression , 2007 .

[2]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[3]  Juan Touriño,et al.  Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures , 2009, PVM/MPI.

[4]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[5]  Ananta Tiwari,et al.  Compute bottlenecks on the new 64-bit ARM , 2015, E2SC '15.

[6]  Brian Bockelman,et al.  Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi , 2014, ArXiv.

[7]  Pascal Bouvry,et al.  Performance Evaluation and Energy Efficiency of High-Density HPC Platforms Based on Intel, AMD and ARM Processors , 2013, EE-LSDS.

[8]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[9]  Eduard Ayguadé,et al.  The Mont-Blanc Prototype: An Alternative Approach for HPC Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Alex Ramírez,et al.  The low-power architecture approach towards exascale computing , 2011, ScalA '11.

[11]  Antti Ylä-Jääski,et al.  Energy- and Cost-Efficiency Analysis of ARM-Based Clusters , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[12]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Karthikeyan Sankaralingam,et al.  Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Drago Zagar,et al.  Towards an energy efficient SoC computing cluster , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[15]  Thomas F. Wenisch,et al.  Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[16]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Mateo Valero,et al.  Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[20]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[21]  Jignesh M. Patel,et al.  Wimpy node clusters: what about non-wimpy workloads? , 2010, DaMoN '10.

[22]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[23]  Geoffrey Fox,et al.  Evaluating ARM HPC clusters for scientific workloads , 2015, Concurr. Comput. Pract. Exp..

[24]  Luiz Marcos Garcia Gonçalves,et al.  Towards green data centers: A comparison of x86 and ARM architectures power efficiency , 2012, J. Parallel Distributed Comput..

[25]  Sherief Reda,et al.  Scheduling challenges and opportunities in integrated CPU+GPU processors , 2016, 2016 14th ACM/IEEE Symposium on Embedded Systems For Real-time Multimedia (ESTIMedia).

[26]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[27]  Alejandro Rico,et al.  Tibidabo: Making the case for an ARM-based HPC system , 2014, Future Gener. Comput. Syst..

[28]  Andrzej Nowak,et al.  Hierarchical cycle accounting: a new method for application performance tuning , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[29]  Reza Azimi,et al.  How Good Are Low-Power 64-Bit SoCs for Server-Class Workloads? , 2015, 2015 IEEE International Symposium on Workload Characterization.