Performance characteristics of the Cray X1 and their implications for application performance tuning

During the last decade the scientific computing community has optimized many applications for execution on superscalar computing platforms. The recent arrival of the Japanese Earth Simulator has revived interest in vector architectures especially in the US. It is important to examine how to port our current scientific applications to the new vector platforms and how to achieve high performance. The success of porting these applications will also influence the acceptance of new vector architectures. In this paper, we first investigate the memory performance characteristics of the Cray X1, a recently released vector platform, and determine the most influential performance factors. Then, we examine how to optimize applications tuned on superscalar platforms for the Cray X1 using its performance characteristics as guidelines. Finally, we evaluate the different types of optimizations used, the effort for their implementations, and whether they provide any performance benefits when ported back to superscalar platforms.

[1]  P.H. Worley,et al.  Early Evaluation of the Cray X1 , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[2]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[3]  Hiroshi Takahara,et al.  A 26.58 Tflops Global Atmospheric Simulation with the Spectral Transform Method on the Earth Simulator , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[4]  Jaswinder Pal Singh,et al.  A comparison of MPI, SHMEM and cache-coherent shared address space programming models on the SGI Origin2000 , 1999, ICS '99.

[5]  Wayne R. Cowell,et al.  Transforming FORTRAN DO loops to improve performance on vector architectures , 1986, TOMS.

[6]  Leonid Oliker,et al.  A Performance Evaluation of the Cray X1 for Scientific Applications , 2004, VECPAR.

[7]  Gordon Bell,et al.  What's next in high-performance computing? , 2002, CACM.

[8]  Guy E. Blelloch,et al.  Radix sort for vector multiprocessors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[9]  Patrick H. Worley,et al.  Early Evaluation of the Cray X1 , 2003, SC.

[10]  Hui Cheng,et al.  Vector pipelining, chaining, and speed on the IBM 3090 and Cray X-MP , 1989, Computer.

[11]  Leonid Oliker,et al.  A Comparison of Three Programming Models for Adaptive Applications on the Origin2000 , 2000, ACM/IEEE SC 2000 Conference (SC'00).