论文信息 - Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines? In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors. We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.

[1] Gordon Moore. Solid State: VLSI: Some fundamental challenges: Defining and designing the products made possible by very-large-scale integration are first on the list of priority tasks , 1979, IEEE Spectrum.

[2] H. T. Kung. Memory requirements for balanced computer architectures , 1986, ISCA '86.

[3] L. Hernquist. Hierarchical N-body methods , 1987 .

[4] Mark A. Johnson,et al. Solving problems on concurrent processors. Vol. 1: General techniques and regular problems , 1988 .

[5] G. C. Fox,et al. Solving Problems on Concurrent Processors , 1988 .

[6] David H. Bailey,et al. FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[7] C. Loan. Computational Frameworks for the Fast Fourier Transform , 1992 .

[8] John K. Salmon,et al. Parallel hierarchical N-body methods , 1992 .

[9] Marc Levoy,et al. Volume rendering on scalable shared-memory MIMD architectures , 1992, VVS.

[10] Anoop Gupta,et al. Scaling parallel programs for multiprocessors: methodology and examples , 1993, Computer.