Performance portability on EARTH: a case study across several parallel architectures

Abstract Due to the increase of the diversity of parallel architectures, and the increasing development time for parallel applications, performance portability has become one of the major considerations when designing the next generation of parallel program execution models, APIs, and runtime system software. This paper analyzes both code portability and performance portability of parallel programs for fine-grained multi-threaded execution and architecture models. We concentrate on one particular event-driven fine-grained multi-threaded execution model—EARTH, and discuss several design considerations of the EARTH model and runtime system that contribute to the performance portability of parallel applications. We believe that these are important issues for future high end computing system software design. Four representative benchmarks were conducted on several different parallel architectures, including two clusters listed in the 23rd supercomputer TOP500 list. The results demonstrate that EARTH based programs can achieve robust performance portability across the selected hardware platforms without any code modification or tuning.

[1]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[2]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[3]  Guang R. Gao,et al.  Performance Study of a Whole Genome Comparison Tool on a Hyper-Threading Multiprocessor , 2003, ISHPC.

[4]  Al Geist,et al.  LPVM: a step towards multithread PVM , 1998 .

[5]  Guang R. Gao,et al.  Implementing parallel conjugate gradient on the EARTH multithreaded architecture , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[6]  Jin-Soo Kim,et al.  ParADE: An OpenMP Programming Environment for SMP Cluster Systems , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7]  Pradeep Dubey,et al.  Platform 2015: Intel ® Processor and Platform Evolution for the Next Decade , 2005 .

[8]  Júlio S. Aude,et al.  PM-PVM: A portable multithreaded PVM , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[9]  Ali R. Hurson,et al.  Dataflow architectures and multithreading , 1994, Computer.

[10]  Hiroshi Harada,et al.  Implementation and Evaluation of MPI on an SMP Cluster , 1998, IPPS/SPDP Workshops.

[11]  Gerd Heber,et al.  Landing CG on EARTH: A Case Study of Fine-Grained Multithreading on an Evolutionary Path , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[12]  John Goodacre Challenges in programming multiprocessor platforms , 2004 .

[13]  Guang R. Gao,et al.  A design study of the EARTH multiprocessor , 1995, PACT.

[14]  Guang R. Gao,et al.  Design of the Runtime System for the Portable Threaded-C Language , 1998 .

[15]  Adam J. Ferrari,et al.  Multiparadigm distributed computing with TPVM , 1998 .

[16]  Haoqiang Jin,et al.  Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster , 2003 .

[17]  Steve Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP's , 1999, SC.

[18]  Robert M. Keller,et al.  Data Flow Program Graphs , 1982, Computer.

[19]  Gerd Heber,et al.  Implementation and evaluation of a communication intensive application on the EARTH multithreaded system , 2002, Concurr. Comput. Pract. Exp..

[20]  Guang R. Gao,et al.  Experiences with non-numeric applications on multithreaded architectures , 1997, PPOPP '97.

[21]  Guang R. Gao,et al.  Earth: an efficient architecture for running threads , 1999 .

[22]  David R. Butenhof Programming with POSIX threads , 1993 .

[23]  S. Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP’s , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[24]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[25]  Tao Yang,et al.  Optimizing threaded MPI execution on SMP clusters , 2001, ICS '01.

[26]  Keqin Li,et al.  Broadcast on Clusters of SMPs with Optimal Concurrency , 2002, PDPTA.

[27]  Arvind,et al.  The U-Interpreter , 1982, Computer.

[28]  Rolf Rabenseifner,et al.  Hybrid Parallel Programming: Performance Problems and Chances , 2003 .

[29]  Guang R. Gao,et al.  A cluster-based solution for high performance hmmpfam using EARTH execution model , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[30]  Michael Mikolajczak,et al.  Designing And Building Parallel Programs: Concepts And Tools For Parallel Software Engineering , 1997, IEEE Concurrency.

[31]  Guang R. Gao,et al.  Multithreaded algorithms for the fast Fourier transform , 2000, SPAA '00.

[32]  Erik Hagersten,et al.  THROOM : Running POSIX Multithreaded Binaries on a Cluster , 2003 .

[33]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[34]  Angelos Bilas,et al.  CableS : Thread Control and Memory System Extensions for Shared Virtual Memory Clusters , 2001, WOMPAT.

[35]  William Gropp,et al.  Users guide for mpich, a portable implementation of MPI , 1996 .

[36]  Mark Bull,et al.  Development of mixed mode MPI / OpenMP applications , 2001, Sci. Program..

[37]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[38]  Gerd Heber,et al.  A fine-grain load-adaptive algorithm of the 2D discrete wavelet transform for multithreaded architectures , 2004, J. Parallel Distributed Comput..

[39]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[40]  Edward A. Lee,et al.  Advances in the dataflow computational model , 1999, Parallel Comput..

[41]  Guang R. Gao,et al.  Multithreaded algorithms for pricing a class of complex options , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[42]  Ralf H. Reussner,et al.  Achieving Performance Portability with SKaMPI for High-Performance MPI Programs , 2001, International Conference on Computational Science.

[43]  Willy Zwaenepoel,et al.  OpenMP on Networks of Workstations , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[44]  Al Geist,et al.  LPVM: a step towards multithread PVM , 1998, Concurr. Pract. Exp..

[45]  Herbert Hing-Jing Hum The super-actor machine: a hybrid dataflow/Von Neumann architecture , 1992 .

[46]  Mitsuhisa Sato,et al.  Performance of cluster-enabled OpenMP for the SCASH software distributed shared memory system , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[47]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.