Techniques for Improving the Performance of Parallel Computations

Developing parallel implementations of applications which utilise an acceptably large fraction of the peak performance of current high performance computers has proved a di cult task. The lack of success in this endeavour is perceived as a major impediment to the general acceptance of high performance computing in industry. Even for structured, `static', scienti c and engineering applications coded in FORTRAN, where performance is apparently predictable, success has been limited. This Thesis argues that, while the development of high performance applications for parallel systems remains an experimental task suitable only for the expert programmer, systematic techniques, which maximise the bene t of programmer e ort, can be employed in order to develop `good' parallel implementations rapidly. A framework for such a method is presented, and a set of supporting techniques is developed, by means of a series of examples on the Kendall Square Research, KSR1. The method requires the achieved performance of an implementation to be described in terms of an `ideal' parallel performance, plus a small number of (parallel) overhead terms. Once the magnitude of each overhead term has been quanti ed, a systematic, iterative, process of overhead minimisation can take place. The source of each targeted overhead is analysed, and an alternative implementation, which reduces the overhead, is developed. Analysis of the overheads requires a mixture of experiment and modelling. 7 Declaration No portion of the work referred to in this thesis has been submitted in support of an application for another degree or quali cation of this or any other university or other institution of learning. 8 Copyright Copyright in text of this thesis rests with the Author. Copies (by any process) either in full, or of extracts, may be made only in accordance with instructions given by the Author and lodged in the John Rylands University Library of Manchester. Details may be obtained from the Librarian. This page must form part of any such copies made. Further copies (by any process) of copies made in accordance with such instructions may not be made without the permission (in writing) of the Author. The ownership of any intellectual property rights which may be described in this thesis is vested in the University of Manchester, subject to any prior agreement to the contrary, and may not be made available for use by third parties without the written permission of the University, which will prescribe the terms and conditions of any such agreement. Further information on the conditions under which disclosures and exploitation may take place is available from the head of Department of Computer Science. 9 Education and Research The author graduated from the University of Manchester in 1978 with a BSc(Hons) in Physics. After obtaining a PGCE from Christ College Liverpool, and completing his probationary teaching year, he moved into the area of real-time systems simulation in industry, rst with Redifusion Flight Simulation in Crawley, then with Ferranti Computer Systems in Cheadle, Manchester, before joining the Centre for Novel Computing in the Department of Computer Science at the University of Manchester in 1990. 10 Acknowledgements I would like to thank the following| CNC people past and present for the stimulating tea breaks, in particular Mark, Rupert, Rob and Andy. Also John, for both the opportunity and his patience; Mum and Dad for all their support and encouragement through the years; and nally Pat, for making it all worthwhile. 11 Chapter

[1]  C. Brooks Computer simulation of liquids , 1989 .

[2]  Graham F. Carey,et al.  Parallel supercomputing: methods, algorithms and applications , 1989 .

[3]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[4]  Chris R. Jesshope,et al.  Parallel Computers 2: Architecture, Programming and Algorithms , 1981 .

[5]  Graham D. Riley,et al.  Parallelisation of the SDEM distinct element stress analysis code on the KSR-1 , 1994, ICS '94.

[6]  John Sargeant UFO - United Functions and Objects: Draft Language Description , 1992 .

[7]  Moustafa Ghanem,et al.  Structured parallel programming , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[8]  Graham D. Riley,et al.  Parallelization of a Three-Dimensional Shallow-Water Estuary Model on the KSR-1 , 1995, Sci. Program..

[9]  Mark Crovella,et al.  The Search for Lost Cycles: A New Approach to Parallel Program Performance Evaluation , 1993 .

[10]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[11]  Ian Foster,et al.  Designing and building parallel programs , 1994 .

[12]  Anoop Gupta,et al.  Scaling parallel programs for multiprocessors: methodology and examples , 1993, Computer.

[13]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[14]  Gd Riley,et al.  Parallelisation of a Semantic Network Classifier , 1995 .

[15]  David E. Culler,et al.  Two Fundamental Limits on Dataflow Multiprocessing , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.

[16]  Laxmi N. Bhuyan,et al.  High-performance computer architecture , 1995, Future Gener. Comput. Syst..

[17]  Barbara M. Chapman,et al.  Extending HPF for Advanced Data-Parallel Applications , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[18]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[19]  R. Pavani,et al.  Parallel Numerical Linear Algebra , 1995, PDP.

[20]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[21]  L GustafsonJohn Reevaluating Amdahl's law , 1988 .

[22]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[23]  Constantine D. Polychronopoulos,et al.  The hierarchical task graph and its use in auto-scheduling , 1991, ICS '91.

[24]  John A. Keane,et al.  Representation of coherency classes for parallel systems , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[25]  David Mosberger,et al.  Memory consistency models , 1993, OPSR.

[26]  Wen-yew Liang,et al.  ADSMITH: A Structure-Based Heterogeneous Distributed Shared Memory on PVM , 1994 .

[27]  D. B. Skillicorn Structuring data parallelism using categorical data types , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[28]  Jonathan M. D. Hill,et al.  The theory, practice, and a tool for BSP performance prediction applied to a CFD application , 1996 .

[29]  Thomas G. Macdonald,et al.  MPP Fortran Programming Model , 1992 .

[30]  Roger W. Hockney A framework for benchmark performance analysis , 1991 .

[31]  Michael O'Boyle,et al.  Program and data transformations for efficient execution on distributed memory architectures , 1993, Technical report series.

[32]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[33]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[34]  Rizos Sakellariou,et al.  On the Quest for Perfect Load Balance in Loop-Based Parallel Computations , 1996 .