Performance tuning and analysis of future vector processors based on the roofline model

Because of a recent steep drop in the ratio of memory bandwidth to computational performance (B/F) of vector processors, their advantage against scalar ones regarding relatively high sustained performance is decaying. To cover the insufficient B/F rate, an on-chip vector cache mechanism is promising for the vector processors. Although the effectiveness of the vector cache has been evaluated, cache-conscious tuning of vector codes and the analysis of the obtained performance have not been discussed yet. Under this situation, the purpose of this paper is to establish a strategy for performance tuning of a vector processor with a cache to exploit its potential. To analyze its sustained performance, this paper uses the roofline model. Several optimization techniques are applied to real scientific and engineering applications, and their effects are assessed with the model. We confirm that the model can guide users to effective tuning so as to maximize its gain. We also discuss the energy efficiency of the on-chip vector cache.

[1]  Hiroaki Kobayashi,et al.  An on-chip cache design for vector processors , 2007, MEDEA '07.

[2]  Hiroaki Kobayashi,et al.  First Experiences with NEC SX-9 , 2008, High Performance Computing on Vector Systems.

[3]  Hiroaki Kobayashi,et al.  The Potential of On-Chip Memory Systems for Future Vector Architectures , 2008 .

[4]  Hiroaki Kobayashi,et al.  Performance evaluation of NEC SX-9 using real science and engineering applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Jeffrey S. Vetter,et al.  Performance evaluation of the Cray X1 distributed shared memory architecture , 2004, Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects.

[6]  Akira Hasegawa,et al.  The key frictional parameters controlling spatial variations in the speed of postseismic-slip propagation on a subduction plate boundary , 2007 .

[7]  Leonid Oliker,et al.  A Performance Evaluation of the Cray X1 for Scientific Applications , 2004, VECPAR.

[8]  Hiroaki Kobayashi,et al.  Effects of MSHR and Prefetch Mechanisms on an On-Chip Cache of the Vector Architecture , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[9]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[10]  NAKAZATO Satoshi,et al.  Hardware Technology of the SX-9 ( 1 )-Main System - , .

[11]  Motoyuki Sato,et al.  FDTD Simulation on Array Antenna SAR-GPR for Land Mine Detection , 2005 .

[12]  Erich Strohmaier,et al.  Performance characteristics of the Cray X1 and their implications for application performance tuning , 2004, ICS '04.

[13]  Kunio SAWAYA,et al.  STUDY OF HIGH GAIN AND BROADBAND ANTIPODAL FERMI ANTENNA WITH CORRUGATION , 2004 .

[14]  Steve Scott,et al.  The Cray BlackWidow: a highly scalable vector multiprocessor , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[15]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.