Compute bottlenecks on the new 64-bit ARM

The trifecta of power, performance and programmability has spurred significant interest in the 64-bit ARMv8 platform. These new systems provide energy efficiency, a traditional CPU programming model, and the potential of high performance when enough cores are thrown at the problem. However, it remains unclear how well the ARM architecture will work as a design point for the High Performance Computing market. In this paper, we characterize and investigate the key architectural factors that impact power and performance on a current ARMv8 offering (X-Gene 1) and Intel's Sandy Bridge processor. Using Principal Component Analysis, multiple linear regression models, and variable importance analysis we conclude that the CPU frontend has the biggest impact on performance on both the X-Gene and Sandy Bridge processors.

[1]  Vincent M. Weaver,et al.  Design and Analysis of a 32-bit Embedded High-Performance Cluster Optimized for Energy and Performance , 2014, 2014 Hardware-Software Co-Design for High Performance Computing.

[2]  Pascal Bouvry,et al.  Performance Evaluation and Energy Efficiency of High-Density HPC Platforms Based on Intel, AMD and ARM Processors , 2013, EE-LSDS.

[3]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[4]  Ananta Tiwari,et al.  Characterizing the Performance-Energy Tradeoff of Small ARM Cores in HPC Computation , 2014, Euro-Par.

[5]  Andreas Moshovos,et al.  Instruction flow-based front-end throttling for power-aware high-performance processors , 2001, ISLPED '01.

[6]  Alejandro Rico,et al.  Tibidabo: Making the case for an ARM-based HPC system , 2014, Future Gener. Comput. Syst..

[7]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[8]  Phillip Stanley-Marbell,et al.  Performance, Power, and Thermal Analysis of Low-Power Processors for Scale-Out Systems , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[9]  Brian Bockelman,et al.  Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi , 2014, ArXiv.

[10]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[11]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[12]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[13]  Francieli Zanon Boito,et al.  Performance/energy trade-off in scientific computing: the case of ARM big.LITTLE and Intel Sandy Bridge , 2015, IET Comput. Digit. Tech..

[14]  Ananta Tiwari,et al.  Making the Most of SMT in HPC , 2014, ACM Trans. Archit. Code Optim..

[15]  Pradeep Dubey,et al.  Can traditional programming bridge the Ninja performance gap for parallel computing applications? , 2015, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[16]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[17]  Mark S. Gordon,et al.  Performance and energy efficiency analysis of 64-bit ARM using GAMESS , 2015, Co-HPC@SC.

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Antti Ylä-Jääski,et al.  Energy- and Cost-Efficiency Analysis of ARM-Based Clusters , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[21]  Simon D. Hammond,et al.  Analysis of Cray XC30 Performance Using Trinity-NERSC-8 Benchmarks and Comparison with Cray XE6 and IBM BG/Q , 2013, PMBS@SC.

[22]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.