HMPI: towards a message-passing library for heterogeneous networks of computers

The paper presents Heterogeneous MPI (HMPI), an extension of MPI for programming high-performance computations on heterogeneous networks of computers. It allows the application programmer to describe the performance model of the implemented algorithm. This model allows for all the main features of the underlying parallel algorithm, which have an impact on its execution performance, such as the total number of parallel processes, the total volume of computations to be performed by each process, the total volume of data to be transferred between each pair of the processes, and how exactly the processes interact during the execution of the algorithm. Given the description of the performance model, HMPI creates a group of processes executing the algorithm faster than any other group of processes. The most principal extensions to MPI are presented. Parallel simulation of the interaction of electric and magnetic fields and parallel matrix multiplication are used to demonstrate the features of the library.

[1]  Luc Bougé,et al.  A Portable and Adaptative Multi-protocol Communication Library for Multithreaded Runtime Systems , 2000, IPDPS Workshops.

[2]  F. Pellegrini,et al.  Static mapping by dual recursive bipartitioning of process architecture graphs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[3]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[4]  Sandhya Dwarkadas,et al.  Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations , 2001, PPoPP '01.

[5]  Anthony A. Maciejewski,et al.  Task Matching and Scheduling in Heterogenous Computing Environments Using a Genetic-Algorithm-Based Approach , 1997, J. Parallel Distributed Comput..

[6]  Jameela Al-Jaroodi,et al.  Modeling parallel applications performance on heterogeneous systems , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[7]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[8]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[9]  Bruce Hendrickson,et al.  A Multi-Level Algorithm For Partitioning Graphs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[10]  Andrew S. Grimshaw,et al.  The Legion vision of a worldwide virtual computer , 1997, Commun. ACM.

[11]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[12]  Ed Anderson,et al.  LAPACK users' guide - [release 1.0] , 1992 .

[13]  Howard Jay Siegel,et al.  A dynamic matching and scheduling algorithm for heterogeneous computing systems , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[14]  Alexey L. Lastovetsky,et al.  An Approach to Assessment of Heterogeneous Parallel Algorithms , 2003, PaCT.

[15]  Alexey Lastovetsky,et al.  A language approach to high performance computing on heterogeneous networks , 2001 .

[16]  Ming Wu,et al.  Memory conscious task partition and scheduling in grid environments , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[17]  Michael W. Godfrey,et al.  An overview of MSHN: the Management System for Heterogeneous Networks , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[18]  Harold S. Stone,et al.  Critical Load Factors in Two-Processor Distributed Systems , 1978, IEEE Transactions on Software Engineering.

[19]  Robert A. van de Geijn,et al.  Scalability Issues Affecting the Design of a Dense Linear Algebra Library , 1994, J. Parallel Distributed Comput..

[20]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[21]  Anthony Skjellum,et al.  The Parallel Mathematical Libraries Project (PMLP): Overview, Design Innovations, and Preliminary Results , 1999, PaCT.

[22]  Bruce Hendrickson,et al.  The Chaco user`s guide. Version 1.0 , 1993 .

[23]  R. F. Freund,et al.  Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[24]  Katherine A. Yelick,et al.  Portable Parallel Irregular Applications , 1995, PSLS.

[25]  J. Ramanujam,et al.  Memory-Constrained Communication Minimization for a Class of Array Computations , 2002, LCPC.

[26]  Guy E. Blelloch,et al.  A practical comparison of N-body algorithms , 1994, Parallel Algorithms.

[27]  Francine Berman,et al.  Application-Level Scheduling on Distributed Heterogeneous Networks , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[28]  Mohammed J. Zaki,et al.  Compile-Time Scheduling Algorithms for a Heterogeneous Network of Workstations , 1997, Comput. J..

[29]  Ian T. Foster,et al.  Managing Multiple Communication Methods in High-Performance Networked Computing Systems , 1997, J. Parallel Distributed Comput..

[30]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[31]  Pat Morin Coarse grained parallel computing on heterogeneous systems , 1998, SAC '98.

[32]  Paolo Palazzari,et al.  Real time pipelined system design through simulated annealing , 1996, J. Syst. Archit..

[33]  Yong Yan,et al.  Modeling and characterizing parallel computing performance on heterogeneous networks of workstations , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[34]  William E. Johnston,et al.  Grids as production computing environments: the engineering aspects of NASA's Information Power Grid , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[35]  Francine Berman,et al.  Program Speedup in a Heterogeneous Computing Network , 1994, J. Parallel Distributed Comput..

[36]  Harold S. Stone,et al.  Multiprocessor Scheduling with the Aid of Network Flow Algorithms , 1977, IEEE Transactions on Software Engineering.

[37]  Sathish S. Vadhiyar,et al.  Towards an Accurate Model for Collective Communications , 2004, Int. J. High Perform. Comput. Appl..

[38]  Jack J. Dongarra,et al.  HARNESS and fault tolerant MPI , 2001, Parallel Comput..

[39]  Alexey Lastovetsky,et al.  AN OVERVIEW OF HETEROGENEOUS HIGH PERFORMANCE AND GRID COMPUTING , 2004 .

[40]  Füsun Özgüner,et al.  Dynamic, competitive scheduling of multiple DAGs in a distributed heterogeneous environment , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[41]  Shuichi Ichikawa,et al.  An execution-time estimation model for heterogeneous clusters , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[42]  Jorge G. Barbosa,et al.  Simulation of data distribution strategies for LU factorization on heterogeneous machines , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[43]  Baruch Awerbuch,et al.  An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster , 2000, IEEE Trans. Parallel Distributed Syst..

[44]  Xiaodong Zhang,et al.  Erratum: "An Effective and Practical Performance Prediction Model for Parallel Computing on Nondedicated Heterogeneous NOW" , 1997, J. Parallel Distributed Comput..

[45]  Li Xiao,et al.  Dynamic Cluster Resource Allocations for Jobs with Known and Unknown Memory Demands , 2002, IEEE Trans. Parallel Distributed Syst..

[46]  Andrea Clematis,et al.  Modeling performance of heterogeneous parallel computing systems , 1999, Parallel Comput..

[47]  Sajal K. Das,et al.  Graph partitioning for parallel applications in heterogeneous Grid environments , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[48]  Alexey L. Lastovetsky,et al.  On performance analysis of heterogeneous parallel algorithms , 2004, Parallel Comput..

[49]  Franck Cappello,et al.  HiHCoHP-Toward a realistic communication model for hierarchical hyperclusters of heterogeneous processors , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[50]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[51]  Philip J. Hatcher,et al.  Data-Parallel Programming on MIMD Computers , 1991, IEEE Trans. Parallel Distributed Syst..

[52]  Francine Berman,et al.  Adaptive Computing on the Grid Using AppLeS , 2003, IEEE Trans. Parallel Distributed Syst..

[53]  Alexey Lastovetsky,et al.  Towards a Realistic Performance Model for Networks of Heterogeneous Computers , 2005 .

[54]  Ladislau Bölöni,et al.  A comparison study of static mapping heuristics for a class of meta-tasks on heterogeneous computing systems , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[55]  Gary L. Miller,et al.  A unified geometric approach to graph separators , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[56]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[57]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[58]  Massachusett Framingham,et al.  The Common Object Request Broker: Architecture and Specification Version 3 , 2003 .

[59]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[60]  R. M. Mattheyses,et al.  A Linear-Time Heuristic for Improving Network Partitions , 1982, 19th Design Automation Conference.

[61]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[62]  Chris Walshaw,et al.  Mesh Partitioning: A Multilevel Balancing and Refinement Algorithm , 2000, SIAM J. Sci. Comput..

[63]  Lalit M. Patnaik,et al.  Genetic algorithms: a survey , 1994, Computer.

[64]  Alexey Lastovetsky Parallel computing on heterogeneous networks , 2003 .

[65]  Tamara G. Kolda,et al.  Partitioning Rectangular and Structurally Unsymmetric Sparse Matrices for Parallel Processing , 1999, SIAM J. Sci. Comput..

[66]  Anthony Skjellum,et al.  MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[67]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[68]  Alexey L. Lastovetsky,et al.  Adaptive parallel computing on heterogeneous networks with mpC , 2002, Parallel Comput..

[69]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[70]  Viktor K. Prasanna,et al.  Efficient collective communication in distributed heterogeneous systems , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[71]  Ümit V. Çatalyürek,et al.  Decomposing Irregularly Sparse Matrices for Parallel Matrix-Vector Multiplication , 1996, IRREGULAR.

[72]  Vipin Kumar,et al.  A New Algorithm for Multi-objective Graph Partitioning , 1999, Euro-Par.

[73]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[74]  Kees Verstoep,et al.  Fast Measurement of LogP Parameters for Message Passing Platforms , 2000, IPDPS Workshops.

[75]  Yan Alexander Li,et al.  Minimizing the Application Execution Time Through Scheduling of Subtasks and Communication Traffic in a Heterogeneous Computing System , 1997, IEEE Trans. Parallel Distributed Syst..

[76]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[77]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[78]  P. Raghavan Line and plane separators , 1993 .

[79]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[80]  Debasish Ghose,et al.  Scheduling Divisible Loads in Parallel and Distributed Systems , 1996 .

[81]  Chris Peterson,et al.  Implementing a Performance Forecasting System for Metacomputing The Network Weather Service , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[82]  Carl Kesselman,et al.  A Network Performance Tool for Grid Environments , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[83]  Vipin Kumar,et al.  A Unified Algorithm for Load-balancing Adaptive Scientific Simulations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[84]  Yves Robert,et al.  Partitioning a Square into Rectangles: NP-Completeness and Approximation Algorithms , 2002, Algorithmica.

[85]  Viktor K. Prasanna,et al.  Block‐cyclic redistribution over heterogeneous networks , 2004, Cluster Computing.

[86]  Rossen Dimitrov,et al.  Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving , 2001 .

[87]  Michael G. Norman,et al.  Models of machines and computation for mapping in multicomputers , 1993, CSUR.

[88]  John K. Antonio,et al.  Software support for heterogeneous computing , 1996, CSUR.

[89]  Brian W. Kernighan,et al.  An efficient heuristic procedure for partitioning graphs , 1970, Bell Syst. Tech. J..

[90]  Bruce Hendrickson,et al.  An Improved Spectral Load Balancing Method , 1993, PPSC.

[91]  Dhabaleswar K. Panda,et al.  Efficient collective communication on heterogeneous networks of workstations , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[92]  Nikolay N. Mirenkov,et al.  Self-Explanatory Components: A New Programming Paradigm , 2001, Int. J. Softw. Eng. Knowl. Eng..

[93]  Elizabeth A. Post,et al.  Evaluating the parallel performance of a heterogeneous system , 2001 .

[94]  Henri E. Bal,et al.  Bandwidth-efficient collective communication for clustered wide area systems , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[95]  Francine Berman,et al.  Modeling the effects of contention on the performance of heterogeneous applications , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[96]  Patrick Ciarlet,et al.  On the validity of a front-oriented approach to partitioning large sparse graphs with a connectivity constraint , 2005, Numerical Algorithms.

[97]  Sanjay Ranka,et al.  Array Decompositions for Nonuniform Computational Environments , 1996, J. Parallel Distributed Comput..

[98]  Bruce Hendrickson,et al.  The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers , 1994, SIAM J. Sci. Comput..

[99]  Ming Wu,et al.  Grid Harvest Service: a system for long-term, application-level task scheduling , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[100]  Bruce Lowekamp,et al.  ECO: Efficient Collective Operations for communication on heterogeneous networks , 1996, Proceedings of International Conference on Parallel Processing.

[101]  Yves Robert,et al.  Heterogeneity Considered Harmful to Algorithm Designers , 2000, CLUSTER.

[102]  Curt Jones,et al.  A Heuristic for Reducing Fill-In in Sparse Matrix Factorization , 1993, PPSC.

[103]  Alexey L. Lastovetsky,et al.  Heterogeneous Distribution of Computations Solving Linear Algebra Problems on Networks of Heterogeneous Computers , 2001, J. Parallel Distributed Comput..

[104]  Laxmikant V. Kale,et al.  Object-Based Adaptive Load Balancing for MPI Programs∗ , 2000 .

[105]  Saman Amarasinghe,et al.  The suif compiler for scalable parallel machines , 1995 .

[106]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[107]  Henri Casanova,et al.  NetSovle: A Network Server for Solving Computational Science Problems , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[108]  Sivan Toledo,et al.  A survey of out-of-core algorithms in numerical linear algebra , 1999, External Memory Algorithms.

[109]  Bruce Hendrickson,et al.  Skewed Graph Partitioning , 1997, PP.

[110]  Horst D. Simon,et al.  Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems , 1994, Concurr. Pract. Exp..

[111]  R. F. Freund,et al.  Dynamic Mapping of a Class of Independent Tasks onto Heterogeneous Computing Systems , 1999, J. Parallel Distributed Comput..

[112]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[113]  Tiffani L. Williams,et al.  A general-purpose model for heterogeneous computation , 2000 .

[114]  Pawel Wolniewicz,et al.  Out-of-Core Divisible Load Processing , 2003, IEEE Trans. Parallel Distributed Syst..

[115]  João Gabriel Silva,et al.  WMPI - Message Passing Interface for Win32 Clusters , 1998, PVM/MPI.

[116]  Vipin Kumar,et al.  Parallel Multilevel Algorithms for Multi-constraint Graph Partitioning (Distinguished Paper) , 2000, Euro-Par.

[117]  Michael J. Quinn,et al.  Block data decomposition for data-parallel programming on a heterogeneous workstation network , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.

[118]  Tiffani L. Williams,et al.  The Heterogeneous Bulk Synchronous Parallel Model , 2000, IPDPS Workshops.

[119]  Andy C. Downton,et al.  Development of a fine-grained parallel Karhunen Loève transform , 2004, J. Parallel Distributed Comput..

[120]  R. F. Freund,et al.  Scheduling resources in multi-user, heterogeneous, computing environments with SmartNet , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[121]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[122]  Stephen R. Schach,et al.  A Shifting Algorithm for Min-Max Tree Partitioning , 1980, JACM.

[123]  David Fernández-Baca,et al.  Allocating Modules to Processors in a Distributed System , 1989, IEEE Trans. Software Eng..

[124]  Jack J. Dongarra,et al.  Performance Analysis of MPI Collective Operations , 2005, IPDPS.

[125]  J. Pasciak,et al.  Computer solution of large sparse positive definite systems , 1982 .

[126]  Csaba Andras Moritz,et al.  LoGPC: modeling network contention in message-passing programs , 1998, SIGMETRICS '98/PERFORMANCE '98.

[127]  Jack J. Dongarra,et al.  Algorithmic Redistribution Methods for Block-Cyclic Decompositions , 1999, IEEE Trans. Parallel Distributed Syst..

[128]  Min-You Wu,et al.  A high-performance mapping algorithm for heterogeneous computing systems , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[129]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[130]  Pawel Wolniewicz,et al.  Divisible Load Scheduling in Systems with Limited Memory , 2004, Cluster Computing.

[131]  Vipin Kumar,et al.  Multilevel Algorithms for Multi-Constraint Graph Partitioning , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[132]  Vipin Kumar,et al.  Multilevel k-way hypergraph partitioning , 1999, DAC '99.

[133]  Vipin Kumar,et al.  A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm , 1997, PP.

[134]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[135]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[136]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[137]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[138]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[139]  Yves Robert,et al.  Matrix Multiplication on Heterogeneous Platforms , 2001, IEEE Trans. Parallel Distributed Syst..

[140]  Richard Wolski,et al.  Predicting the CPU availability of time‐shared Unix systems on the computational grid , 2004, Cluster Computing.

[141]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[142]  Jorge G. Barbosa,et al.  Linear algebra algorithms in a heterogeneous cluster of personal computers , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[143]  Alexey Lastovetsky,et al.  A parallel language and its programming system for heterogeneous networks , 2000 .

[144]  Xian-He Sun Scalability versus Execution Time in Scalable Systems , 2002, J. Parallel Distributed Comput..

[145]  Stephen R. Schach,et al.  Max-Min Tree Partitioning , 1981, JACM.

[146]  Yves Robert,et al.  A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers) , 2001, IEEE Trans. Computers.

[147]  Ümit V. Çatalyürek,et al.  Decomposing Linear Programs for Parallel Solution , 1995, PARA.

[148]  Cosimo Anglano,et al.  Predicting parallel applications performance on non-dedicated cluster platforms , 1998, ICS '98.