Parallel Programming With Hierarchically Tiled Arrays

Writing high performance programs is a non-trivial task and remains a challenge even to advanced programmers. This dissertation describes a new data type, Hierarchically Tiled Array (HTA), that simplifies this task. HTAs are tiled arrays whose elements can either be HTAs or arrays or scalars. The elements can be distributed among a cluster of computers or be collocated in a single processor. They can be accessed and operated like scalars of the conventional n-dimensional arrays. They can also be assigned to one another, or passed as arguments to a function. In essence, HTA is an attempt to adopt tiles as first class data types, and to allow their direct manipulation. Augmenting existing programming languages with HTAs offers several benefits to high performance program developers. HTAs provide a global shared memory abstraction; this significantly reduces the time to develop parallel programs. The control flow of parallel HTA programs resemble sequential programs and hence are very easy to reason. HTAs naturally facilitate the development of recursive blocked algorithms aimed at exploiting deep memory hierarchies. The rich set of well defined operations and vector style expressions lead to code with high clarity and smaller size. Since HTAs are also conventional arrays, their fusion with a language will not add extra burden to programmers. Moreover, the performance benefits of tiling are preserved. To prove these claims, two popular languages, C++ and MATLAB, have been extended with HTA. In addition, the NAS benchmark suite, a set of complex computation intensive parallel programs, have been re-written using HTAs. We compare the lines of code and execution times of HTA programs with that of FORTRAN versions. Our results show

[1]  Lawrence Rauchwerger,et al.  Design and Use of htalib - A Library for Hierarchically Tiled Arrays , 2006, LCPC.

[2]  Robert C. Armstrong POET (Parallel Object-Oriented Environment and Toolkit) and frameworks for scientific distributed computing , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.

[3]  Steven J. Deitz Renewed Hope for Data Parallelism: Unintegrated Support for Task Parallelism in ZPL , 2003 .

[4]  Anthony J. G. Hey,et al.  An Introduction to High Performance Fortran , 1995, Sci. Program..

[5]  David A. Padua,et al.  MaJIC: A Matlab Just-In-time Compiler , 2000, LCPC.

[6]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[7]  Vikram S. Adve,et al.  Using integer sets for data-parallel program analysis and optimization , 1998, PLDI.

[8]  Matteo Frigo A Fast Fourier Transform Compiler , 1999, PLDI.

[9]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[10]  Ton Anh Ngo,et al.  The role of performance models in parallel programming and languages , 1997 .

[11]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[12]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[13]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[14]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[15]  Walter S. Brainerd,et al.  Programmer's guide to Fortran 90 , 1990 .

[16]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[17]  Charles Koelbel An overview of High Performance Fortran , 1992, FORF.

[18]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[19]  David A. Padua,et al.  Programming for Locality and Parallelism with Hierarchically Tiled Arrays , 2003, LCPC.

[20]  Guy E. Blelloch,et al.  Collection-oriented languages , 1991 .

[21]  Bradford L. Chamberlain,et al.  ZPL's WYSIWYG performance model , 1998, Proceedings Third International Workshop on High-Level Parallel Programming Models and Supportive Environments.

[22]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[23]  David A. Padua,et al.  Implementation of Parallel Numerical Algorithms Using Hierarchically Tiled Arrays , 2004, LCPC.

[24]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[25]  James A. Brown,et al.  APL2: Getting Started , 1991, IBM Syst. J..

[26]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[27]  David A. Padua,et al.  A MATLAB to Fortran 90 translator and its effectiveness , 1996, ICS '96.

[28]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[29]  Bradford L. Chamberlain,et al.  The case for high-level parallel programming in ZPL , 1998 .

[30]  José Nelson Amaral,et al.  Shared memory programming for large scale machines , 2006, PLDI '06.

[31]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[32]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[33]  Guy E. Blelloch,et al.  Compiling Collection-Oriented Languages onto Massively Parallel Computers , 1990, J. Parallel Distributed Comput..

[34]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[35]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[36]  Richard M. Brown,et al.  The ILLIAC IV Computer , 1968, IEEE Transactions on Computers.

[37]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[38]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[39]  Bradford L. Chamberlain,et al.  Portable Performance of Data Parallel Languages , 1997, SC.

[40]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[41]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[42]  Piyush Mehrotra,et al.  Vienna Fortran—a Fortran language extension for distributed memory multiprocessors , 1992 .

[43]  Todd L. Veldhuizen,et al.  Techniques for Scientific C , 1999 .

[44]  Bjarne Stroustrup,et al.  The Annotated C++ Reference Manual , 1990 .