论文信息 - Scaling application performance on a cache-coherent multiprocessor

Scaling application performance on a cache-coherent multiprocessor

Hardware-coherent, distributed shared address space systems are increasingly successful at moderate scale. However, it is unclear whether, or with how much difficulty, the performance of a load-store shared address space programming model scales to large processor counts on real applications. We examine this question using an aggressive case-study machine, the SGI Origin2000, up to 128 processors. We show for the first time that scalable performance can indeed be achieved in this programming model on a wide range of applications, including challenging kernels like FFT. However, this does not come easily, even for applications considered to be already highly optimized, and is very often not simply a matter of increasing problem size. Rather, substantial further application restructuring is often needed, which is usually quite algorithmic in nature. We examine how the restructurings compare with those needed for performance portability to shared virtual memory on clusters, and we comment on common programming guidelines for performance portability and scalability as well as on how the programming difficulty compares with that of explicit message passing. We also examine where applications spend their time on this large machine, the impact of special hardware features that the machine provides, and the impact of mapping to the network topology.

Jaswinder Pal Singh | Dongming Jiang

[1] Anoop Gupta,et al. The DASH Prototype: Logic Overhead and Performance , 1993, IEEE Trans. Parallel Distributed Syst..

[2] Ricardo Bianchini,et al. The MIT Alewife machine: architecture and performance , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[3] John L. Hennessy,et al. Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[4] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[5] GuptaAnoop,et al. Parallel Visualization Algorithms , 1994 .

[6] Marc Levoy,et al. Parallel visualization algorithms: performance and architectural implications , 1994, Computer.

[7] M. Levoy,et al. Fast volume rendering using a shear-warp factorization of the viewing transformation , 1994, SIGGRAPH.

[8] Ramesh Subramonian,et al. LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[9] J. Larus,et al. Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[10] Jaswinder Pal Singh,et al. Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors , 1997, PPOPP '97.

[11] J. L. Hennessy,et al. An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors , 1993, Supercomputing '93.

[12] John L. Hennessy,et al. An evaluation of a commercial CC-NUMA architecture-the CONVEX Exemplar SPP1200 , 1997, Proceedings 11th International Parallel Processing Symposium.

[13] Jaswinder Pal Singh,et al. Parallel Implementations of Probabilistic Inference , 1996, Computer.

[14] Jaswinder Pal Singh,et al. Improving parallel shear-warp volume rendering on shared address space multiprocessors , 1997, PPOPP '97.

[15] Gheith A. Abandah,et al. Effects of architectural and technological advances on the HP/Convex Exemplar's memory and communication performance , 1998, ISCA.

[16] Jaswinder Pal Singh,et al. A methodology and an evaluation of the SGI Origin2000 , 1998, SIGMETRICS '98/PERFORMANCE '98.

[17] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[18] Yong Luo,et al. Performance Evaluation of the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Applications , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[19] Sanjeev Kumar,et al. Evaluating synchronization on shared address space multiprocessors: methodology and performance , 1999, SIGMETRICS '99.