Smartapps, an application centric approach to high performance computing: compiler-assisted software and hardware support for reduction operations

State-of-the-art run-time systems are a poor match to diverse, dynamic distributed applications because they are designed to provide support to a wide variety of applications, without much customization to individual specific requirements. Little or no guiding information flows directly from the application to the run-time system to allow the latter to fully tailor its services to the application. As a result, the performance is disappointing. To address this problem, we propose application-centric computing, or SMART APPLICATIONS. In the executable of smart applications, the compiler embeds most run-time system services, and a performance-optimizing feedback loop that monitors the application's performance and adaptively reconfigures the application and the OS/hardware platform. At run-time, after incorporating the code's input and the system's resources and state, the SMARTAPP performs a global optimization. This optimization is instance specific and thus much more tractable than a global generic optimization between application, OS and hardware. The resulting code and resource customization should lead to major speedups. In this paper, we first describe the overall architecture of SMARTAPPS and then present some achievements to date, focusing on compiler-assisted software and hardware techniques for parallelizing reduction operations. These illustrate SMARTAPPS use of adaptive algorithm selection and moderately reconfigurable hardware.

[1]  A G WijshoffHarry,et al.  A quantitative comparison of parallel computation models , 1998 .

[2]  Anant Agarwal,et al.  Evaluating the performance of software cache coherence , 1989, ASPLOS III.

[3]  Geppino Pucci,et al.  A Cost Model for Communication on a Symmetric MultiProcessor , 1998 .

[4]  Guy E. Blelloch,et al.  Accounting for memory bank contention and delay in high-bandwidth multiprocessors , 1995, SPAA '95.

[5]  Nancy M. Amato,et al.  Predicting performance on SMPs. A case study: the SGI Power Challenge , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[6]  Elizabeth Shriver,et al.  Attribute-managed storage , 1995 .

[7]  Markus Mock,et al.  A retrospective on: "an evaluation of staged run-time optimizations in DyC" , 2004, SIGP.

[8]  Mark Horowitz,et al.  Modeling the performance of limited pointers directories for cache coherence , 1991, ISCA '91.

[9]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[10]  Josep Torrellas,et al.  Speculative Parallel Execution of Loops with Cross-Iteration Dependences in DSM Multiprocessors , 1997, HPCA 1997.

[11]  Michael Stumm,et al.  Tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system , 1999, OSDI '99.

[12]  Elizabeth Shriver Performance modeling for realistic storage devices , 1997 .

[13]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[14]  Lawrence Rauchwerger,et al.  Speculative Parallelization of Partially Parallel Loops , 2000, LCR.

[15]  Nancy M. Amato,et al.  Comparing the memory system performance of the HP V-class and SGI Origin 2000 multiprocessors using microbenchmarks and scientific applications , 1999, ICS '99.

[16]  Robert S. Schreiber,et al.  Hpf-2 scope of activities and motivating applications , 1994 .

[17]  Chau-Wen Tseng,et al.  Improving compiler and run-time support for adaptive irregular codes , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[18]  Arif Merchant,et al.  An analytic behavior model for disk drives with readahead caches and request reordering , 1998, SIGMETRICS '98/PERFORMANCE '98.

[19]  Nancy M. Amato,et al.  Run-time methods for parallelizing partially parallel loops , 1995, ICS '95.

[20]  Lawrence Rauchwerger,et al.  The R-LRPD test: speculative parallelization of partially parallel loops , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[21]  Josep Torrellas,et al.  Analysis of Critical Architectural and Program Parameters in a Hierarchical Shared Memory Multiprocessor , 1990, SIGMETRICS.

[22]  Hubertus Franke,et al.  Customization Lite , 1997 .

[23]  Robert J. Fowler,et al.  MINT: a front end for efficient simulation of shared-memory multiprocessors , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[24]  Andrew Rau-Chaplin,et al.  Scalable parallel geometric algorithms for coarse grained multicomputers , 1993, SCG '93.

[25]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[26]  Nancy M. Amato,et al.  Smartapps, an application centric approach to high performance computing: compiler-assisted software and hardware support for reduction operations , 2000, Proceedings 16th International Parallel and Distributed Processing Symposium.

[27]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[28]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[29]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[30]  Yossi Matias,et al.  Can shared-memory model serve as a bridging model for parallel computation? , 1997, SPAA '97.

[31]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[32]  Lawrence Rauchwerger,et al.  Parallelizing while loops for multiprocessor systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[33]  Ben H. H. Juurlink,et al.  A quantitative comparison of parallel computation models , 1996, SPAA '96.

[34]  Josep Torrellas,et al.  Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[35]  Josep Torrellas,et al.  A direct-execution framework for fast and accurate simulation of superscalar processors , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[36]  Michael Stumm,et al.  HFS: a performance-oriented flexible file system based on building-block compositions , 1996, IOPADS '96.

[37]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[38]  Dileep Bhandarkar,et al.  Performance characterization of the Pentium Pro processor , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[39]  Mary K. Vernon,et al.  Comparison of hardware and software cache coherence schemes , 1991, ISCA '91.

[40]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[41]  Susan J. Eggers,et al.  A case for runtime code generation , 1993 .

[42]  Lawrence Rauchwerger,et al.  Adaptive reduction parallelization techniques , 2000, ICS '00.

[43]  Vikram S. Adve,et al.  Comparison of hardware and software cache coherence schemes , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[44]  Josep Torrellas,et al.  Architectural support for parallel reductions in scalable shared-memory multiprocessors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[45]  Lawrence Rauchwerger,et al.  Run-time parallelization: A framework for parallel computation , 1995 .

[46]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[47]  Yunheung Paek,et al.  Advanced Program Restructuring for High-Performance Computers with Polaris , 2000 .

[48]  J. Mark Bull,et al.  Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments , 1998, Euro-Par.