NAS Benchmarks on the Tera MTA
暂无分享,去创建一个
The Tera MTA is new, revolutionary commercial computer based on a multithreaded processor architecture. We have compiled and run the ve NAS kernel parallel benchmarks on a prototype version of the MTA. This paper brieey describes the MTA architecture, our experience with the compiler, and some performance results. We compare a single-processor MTA's performance and ease of programming to that of the Cray T90, the most powerful vector supercomputer made by Cray Research. We found both the MTA and the single-processor T90 required no tuning on four of the ve benchmarks to get respectable performance. The production MTA should be faster on the CG and IS benchmarks, and the T90 is faster on FT and MG. Except for MG, where the T90's faster clock and higher memory-to-processor bandwidth give it an unbeatable advantage, the diierences in performance are relatively small. We have deened four levels of tuning eeort, ranging from \no tuning" to \heroic". The one remaining code, EP, was easily modiied to get vectorized or threaded execution. We report some further improvements that were obtained at higher levels of tuning eeort on the MTA. In general, for these relatively simple benchmarks, we found tuning codes for the MTA signiicantly easier than on massively parallel mul-ticomputers or even on high-performance workstations. In fact, most of the tuning consisted of removing unnecessary locality-enhancing \optimizations" that had been introduced into the NAS codes to improve their performance on computers with cache-based memories.
[1] E. Feig,et al. Modified FFTs for fused multiply-ADD architectures , 1993 .
[2] Bowen Alpern,et al. High-Performance Parallel Implementations of the NAS Kernel Benchmarks on the IBM SP2 , 1995, IBM Syst. J..
[3] Stefan Goedecker,et al. Fast Radix 2, 3, 4, and 5 Kernels for Fast Fourier Transformations on Computers with Overlapping Multiply-Add Instructions , 1997, SIAM J. Sci. Comput..