Model-guided autotuning of high-productivity languages for petascale computing

addresses the enormous complexity of mapping applications to current and future highly parallel platforms - including scalable architectures consisting of tens of thousands of nodes, many-core devices with tens to hundreds of cores, and hierarchical systems providing multi-level parallelism. At systems of these scales, for many important algorithms, performance is dominated by the time required to move data across the levels of the memory hierarchy. As a consequence, locality awareness of algorithms and the efficient management of communication are essential requirements for obtaining scalable parallel performance, and are of particular concern for applications characterized by irregular memory access patterns. We describe the design of a programming system that focuses on productivity of application programmers in expressing locality-aware algorithms for high-end architectures, which are then automatically tuned for performance. The approach combines the successes of two novel concepts for managing locality: high-level specification of user-defined data distributions and model-guided autotuning for data locality. The resulting combined system provides a powerful general mechanism for the specification of data distributions, which can express domain-specific knowledge, and facilitates automatic tuning of a distribution to access patterns in algorithms and its application to different levels of a memory hierarchy. Because there is a clean separation between the specification of a data distribution and the algorithms in which it is used, these can be written separately and composed together to quickly develop new applications that can be tuned in the context of their data set and execution environment. We address key issues for a range of codes that include LU Decomposition, Sparse Matrix-Vector Multiply and Knowledge Discovery. The knowledge discovery algorithms, in particular, stress the proposed language and compiler technology and provide a forcing function for developing tools that address inherent challenges of irregular applications.}

[1]  Ken Kennedy,et al.  Fortran D Language Specification , 1990 .

[2]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[3]  Katherine A. Yelick,et al.  A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.

[4]  Hans P. Zima,et al.  Locality Awareness in a High-Productivity Language , 2008 .

[5]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[6]  Hans P. Zima From FORTRAN 77 to locality-aware high productivity languages for peta-scale computing , 2007, Sci. Program..

[7]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[8]  David Parello,et al.  Facilitating the search for compositions of program transformations , 2005, ICS '05.

[9]  Siegfried Benkner,et al.  The HPF+ Project: Supporting HPF for Advanced Industrial Applications , 1999, Euro-Par.

[10]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  Jacqueline Chame,et al.  Processing-in-memory technology for knowledge discovery algorithms , 2006, DaMoN '06.

[12]  Piyush Mehrotra,et al.  High-level management of communication schedules in HPF-like languages , 1998, ICS '98.

[13]  Chun Chen,et al.  Model-Guided Empirical Optimization for Multimedia Extension Architectures: A Case Study , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[14]  Siegfried Benkner Optimizing Irregular HPF Applications Using Halos , 2000, Concurr. Pract. Exp..

[15]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[16]  Shahid H. Bokhari,et al.  A Partitioning Strategy for Nonuniform Problems on Multiprocessors , 1987, IEEE Transactions on Computers.

[17]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[18]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[19]  H. Zima,et al.  A New Approach to Locality Awareness in High-Productivity Languages , 2006 .

[20]  Siegfried Benkner,et al.  Compiling High Performance Fortran for distributed-memory architectures , 1999, Parallel Comput..

[21]  Barbara M. Chapman,et al.  Vienna-Fortran/HPF Extensions for Sparse and Irregular Problems and Their Compilation , 1997, IEEE Trans. Parallel Distributed Syst..

[22]  Barbara M. Chapman,et al.  Programming in Vienna Fortran , 1992, Sci. Program..

[23]  Siegfried Benkner Optimizing irregular HPF applications using halos , 2000 .

[24]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[25]  Ken Kennedy,et al.  The rise and fall of High Performance Fortran: an historical object lesson , 2007, HOPL.

[26]  Chun Chen,et al.  Model-guided empirical optimization for memory hierarchy , 2007 .

[27]  Chun Chen,et al.  A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization , 2005, LCPC.

[28]  Daniel A. Brokenshire,et al.  Introduction to the Cell Broadband Engine Architecture , 2007, IBM J. Res. Dev..

[29]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[30]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[31]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[32]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[33]  Piyush Mehrotra,et al.  High Performance Fortran: History, Status and Future , 1998, Parallel Comput..