Portable, modular expression of locality

It is difficult to achieve high performance while programming in the large. In particular, maintaining locality hinders portability and modularity. Existing methodologies are not sufficient: explicit communication and coding for locality require the programmer to violate encapsulation and compositionality of software modules, while automated compiler analysis remains unreliable. This thesis presents a performance model that makes thread and object locality explicit. Zones form a runtime hierarchy that reflects the intended clustering of threads and objects, which are dynamically mapped onto hardware units such as processor clusters, pages; or cache lines. This conceptual indirection allows programmers to reason in the abstract about locality without committing to the hardware of a specific memory system. Zones complement conventional coding for locality and may be added to existing code to improve performance without affecting correctness. The integration of zones into the Sather language is described, including an implementation of memory management customized to parameters of the memory system.

[1]  Benjamin G. Zorn The Effect of Garbage Collection on Cache Performance , 1991 .

[2]  Benjamin Gamsa,et al.  Region-Oriented Main Memory Management in Shared-Memory NUMA Multiprocessors , 1992 .

[3]  Anoop Gupta,et al.  Data locality and load balancing in COOL , 1993, PPOPP '93.

[4]  Bowen Alpern,et al.  Hierarchical Tiling: A Methodology for High Performance , 1996 .

[5]  J. Mogul,et al.  Characterization of Organic Illumination Systems , 1989 .

[6]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[7]  Daniel F. Martin,et al.  Solving Poisson's Equation using Adaptive Mesh Renemen t , 1996 .

[8]  Benjamin G. Zorn,et al.  The measured cost of conservative garbage collection , 1993, Softw. Pract. Exp..

[9]  James R. Larus,et al.  Cache considerations for multiprocessor programmers , 1990, CACM.

[10]  Bowen Alpern,et al.  Modeling parallel computers as memory hierarchies , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[11]  James R. Larus,et al.  LCM: memory system support for parallel language implementation , 1994, ASPLOS VI.

[12]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[13]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[14]  Wolf Zimmermann,et al.  An Analysis of the Divergence of Two Sather Dialects , 1996 .

[15]  Barry Hayes Key Objects in Garbage Collection , 1993 .

[16]  Monica S. Lam,et al.  An Efficient Shared Memory Layer for Distributed Memory Machines. , 1994 .

[17]  Susan J. Eggers,et al.  The effectiveness of multiple hardware contexts , 1994, ASPLOS VI.

[18]  Dennis Gannon,et al.  Object-oriented parallel programming , 1995, International Conference on Software Composition.

[19]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[20]  Katherine A. Yelick,et al.  Optimizing parallel programs with explicit synchronization , 1995, PLDI '95.

[21]  Anoop Gupta,et al.  Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.

[22]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[23]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[24]  Anna R. Karlin,et al.  Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling , 1996, TOCS.

[25]  Chih-Po Wen,et al.  Portable library support for irregular applications , 1996 .

[26]  Trevor N. Mudge,et al.  Design Tradeoffs For Software-managed Tlbs , 1994, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[27]  Todd A. Proebsting Code Generation Techniques , 1992 .

[28]  James R. Larus Compiling for shared-memory and message-passing computers , 1993, LOPL.

[29]  Gerry Kane,et al.  MIPS RISC Architecture , 1987 .

[30]  David A. Wood,et al.  Paging tradeoffs in distributed-shared-memory multiprocessors , 1994, Proceedings of Supercomputing '94.

[31]  Willy Zwaenepoel,et al.  Techniques for reducing consistency-related communication in distributed shared-memory systems , 1995, TOCS.

[32]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[33]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[34]  Michael Philippsen,et al.  Data and Process Alignment in Modula-2 , 1994, Automatic Parallelization.

[35]  Richard M. Karp,et al.  Parallel sorting with limited bandwidth , 1995, SPAA '95.

[36]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[37]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[38]  Clemens A. Szyperski,et al.  Iteration abstraction in Sather , 1996, TOPL.

[39]  M. Bohr Interconnect scaling-the real limiter to high performance ULSI , 1995, Proceedings of International Electron Devices Meeting.

[40]  Chu-cheow Lim,et al.  A Parallel Object-Oriented System for Realizing Reusable and Efficient Data Abstractions , 1993 .

[41]  Joel H. Saltz,et al.  PARTI primitives for unstructured and block structured problems , 1992 .

[42]  Richard F. Rashid,et al.  Zone Garbage Collection , 1990, USENIX MACH Symposium.

[43]  William L. Briggs,et al.  A multigrid tutorial , 1987 .

[44]  Tony F. Chan,et al.  Hierarchical algorithms and architectures for parallel scientific computing , 1990, ICS '90.

[45]  Paul R. Wilson,et al.  Uniprocessor Garbage Collection Techniques Submitted to Acm Computing Surveys , 1992 .

[46]  Andrew W. Appel,et al.  Unrolling lists , 1994, LFP '94.

[47]  Michael J. Neely,et al.  An Analysis of the Effects of Memory Allocation Policy on Storage Fragmentation , 1996 .

[48]  Kevin Hammond,et al.  Spiking Your Caches , 1993, Functional Programming.

[49]  Jeff Bonwick,et al.  The Slab Allocator: An Object-Caching Kernel Memory Allocator , 1994, USENIX Summer.

[50]  L. Carter,et al.  Towards a Model for Portable Parallel Performance: Exposing the Memory Hierarchy , 1993 .

[51]  Jürgen Quittek,et al.  Efficient Extensible Synchronization in Sather , 1997, ISCOPE.

[52]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[53]  Jerome A. Feldman,et al.  Mapping connectionist networks onto parallel machines: a library approach , 1997 .

[54]  Alan L. Cox,et al.  ThreadMarks: Shared Memory Computing on Networks of Workstations , 1996, Computer.

[55]  Pierre Jouvelot,et al.  Report on the FX-91 Programming Language , 1992 .

[56]  Alexander Aiken,et al.  Better static memory management: improving region-based analysis of higher-order languages , 1995, PLDI '95.

[57]  Thorsten von Eicken,et al.  技術解説 IEEE Computer , 1999 .

[58]  James R. Goodman,et al.  The declining effectiveness of dynamic caching for general- purpose microprocessors , 1995 .

[59]  Brian N. Bershad,et al.  PRESTO: A system for object‐oriented parallel programming , 1988, Softw. Pract. Exp..

[60]  Brian N. Bershad,et al.  Avoiding conflict misses dynamically in large direct-mapped caches , 1994, ASPLOS VI.

[61]  Dirk Grunwald,et al.  Improving the cache locality of memory allocation , 1993, PLDI '93.

[62]  Paul F. Dubois,et al.  Sather Revisited: A High‐Performance Free Alternative to C++ , 1995 .

[63]  James R. Goodman,et al.  Quantifying Memory Bandwidth Limitations of Current and Future Microprocessors , 1996 .

[64]  Bowen Alpern,et al.  Space-limited procedures: a methodology for portable high-performance , 1995, Programming Models for Massively Parallel Computers.

[65]  The Performance Implications of Locality Information Usage in Shared-Memory . . . , 1996 .

[66]  Clemens A. Szyperski,et al.  Engineering a Programming Language: The Type and Class System of Sather , 1994, Programming Languages and System Architectures.

[67]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[68]  Todd Austin,et al.  Hardware and software mechanisms for reducing load latency , 1996, Technical Report / University of Wisconsin, Madison / Computer Sciences Department.

[69]  Benjamin G Zorn,et al.  The Measured Cost of Conservative Garbage Collection ; CU-CS-573-92 , 1992 .

[70]  Paul R. Wilson,et al.  Uniprocessor Garbage Collection Techniques , 1992, IWMM.

[71]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[72]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[73]  Stephen M. Omohundro,et al.  The Sather 1.1 Specification , 1996 .

[74]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[75]  Monica S. Lam,et al.  Hierarchical Concurrency in Jade , 1991, LCPC.

[76]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[77]  Christopher W. Fraser,et al.  A code generation interface for ANSI C , 1991, Softw. Pract. Exp..

[78]  Jack Dongarra,et al.  Pvm 3 user's guide and reference manual , 1993 .

[79]  Josep Torrellas,et al.  Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching , 1995, ISCA.

[80]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[81]  Robert J. Fowler,et al.  Improving Processor and Cache Locality in Fine-Grain Parallel Computations using Object-Affinity Scheduling and Continuation Passing , 1992 .

[82]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[83]  Fred Douglis,et al.  The Compression Cache: Using On-line Compression to Extend Physical Memory , 1993, USENIX Winter.

[84]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[85]  Paul E. McKenney,et al.  Efficient Kernel Memory Allocation on Shared-Memory Multiprocessors , 1993, USENIX Winter.

[86]  Brian Kingsbury,et al.  Spert-II: A Vector Microprocessor System , 1996, Computer.

[87]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[88]  David A. Wood,et al.  Paging tradeoffs in distributed-shared-memory multiprocessors , 1994, Supercomputing '94.

[89]  Eric A. Brewer,et al.  PRELUDE: A System for Portable Parallel Software , 1992, PARLE.

[90]  Claudio Fleiner,et al.  Parallel Optimizations - Advanced Constructs and Compiler Optimizations for a Parallel, Object Orien , 1997 .

[91]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[92]  Anoop Gupta,et al.  COOL: An object-based language for parallel programming , 1994, Computer.

[93]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[94]  Evangelos P. Markatos,et al.  Trace-driven simulation of data alignment and other factors affecting update and invalidate based coherent memory , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[95]  José M. Bernabéu-Aubán,et al.  Solaris MC: A Multi Computer OS , 1996, USENIX Annual Technical Conference.