Modeling Multigrain Parallelism on Heterogeneous Multi-core Processors: A Case Study of the Cell BE

Heterogeneous multi-core processors invest the most significant portion of their transistor budget in customized "accelerator" cores, while using a small number of conventional low-end cores for supplying computation to accelerators. To maximize performance on heterogeneous multi-core processors, programs need to expose multiple dimensions of parallelism simultaneously. Unfortunately, programming with multiple dimensions of parallelism is to date an ad hoc process, relying heavily on the intuition and skill of programmers. Formal techniques are needed to optimize multi-dimensional parallel program designs. We present a model of multi-dimensional parallel computation for steering the parallelization process on heterogeneous multi-core processors. The model predicts with high accuracy the execution time and scalability of a program using conventional processors and accelerators simultaneously. More specifically, the model reveals optimal degrees of multi-dimensional, task-level and data-level concurrency, to maximize performance across cores. We use the model to derive mappings of two full computational phylogenetics applications on a multi-processor based on the IBM Cell Broadband Engine.

[1]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[2]  Maya Gokhale,et al.  Partitioning Hardware and Software for Reconfigurable Supercomputing Applications: A Case Study , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[3]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[4]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[5]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[6]  Milind Girkar,et al.  The hierarchical task graph as a universal intermediate representation , 2007, International Journal of Parallel Programming.

[7]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[8]  Yuan Zhao,et al.  Dependence-Based Code Generation for a CELL Processor , 2006, LCPC.

[9]  Jaspal Subhlok,et al.  Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations , 2000, J. Parallel Distributed Comput..

[10]  P.H. Worley,et al.  Early Evaluation of the Cray X1 , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[11]  Géraud Krawezik,et al.  Performance comparison of MPI and three openMP programming styles on shared memory multiprocessors , 2003, SPAA '03.

[12]  Eduard Ayguadé,et al.  Exploiting multiple levels of parallelism in OpenMP: a case study , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[13]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[14]  G ValiantLeslie A bridging model for parallel computation , 1990 .

[15]  Csaba Andras Moritz,et al.  LoGPC: modeling network contention in message-passing programs , 1998, SIGMETRICS '98/PERFORMANCE '98.

[16]  Alexandros Stamatakis,et al.  RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[17]  Michael Gschwind,et al.  Optimizing Compiler for the CELL Processor , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[18]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[19]  Sadaf R. Alam,et al.  Early evaluation of the Cray XT3 , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[20]  Fabrizio Petrini,et al.  Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[21]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[22]  X. Feng,et al.  PBPI: a High Performance Implementation of Bayesian Phylogenetic Inference , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[23]  Roger D. Chamberlain,et al.  Highly-Scalable Reconfigurable Computing , 2005 .

[24]  Robert Kroeger,et al.  A case study in top-down performance estimation for a large-scale parallel application , 2006, PPoPP '06.

[25]  Peter M. Athanas,et al.  Examining the Viability of FPGA Supercomputing , 2007, EURASIP J. Embed. Syst..

[26]  Phillip B. Gibbons A more practical PRAM model , 1989, SPAA '89.

[27]  Kirk W. Cameron,et al.  Quantifying locality effect in data access delay: memory logP , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[28]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[29]  Thomas Rauber,et al.  Library Support for Hierarchical Multi-Processor Tasks , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[30]  Xizhou Feng,et al.  Parallel algorithms for Bayesian phylogenetic inference , 2003, J. Parallel Distributed Comput..

[31]  Peng-Jun Wan,et al.  A Parallel Computational Model for Heterogeneous Clusters , 2006 .

[32]  A Reconfigurable Computing Model for Biological Research Application of Smith-Waterman Analysis to Bacterial Genomes , 2003 .

[33]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[34]  Csaba Andras Moritz,et al.  LoGPC: Modeling Network Contention in Message-Passing Programs , 2001, IEEE Trans. Parallel Distributed Syst..

[35]  Alexandros Stamatakis,et al.  Dynamic multigrain parallelization on the cell broadband engine , 2007, PPoPP.

[36]  Kathryn M. O'Brien,et al.  Optimizing the Use of Static Buffers for DMA on a CELL Chip , 2006, LCPC.

[37]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[38]  William Gropp,et al.  Reproducible Measurements of MPI Performance Characteristics , 1999, PVM/MPI.

[39]  Fumihiko Ino,et al.  LogGPS: a parallel computational model for synchronization analysis , 2001, PPoPP '01.

[40]  Gerhard Goos,et al.  Open Hypermedia Systems and Structural Computing , 2002, Lecture Notes in Computer Science.

[41]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[42]  Jaspal Subhlok,et al.  A new model for integrated nested task and data parallel programming , 1997, PPOPP '97.

[43]  Xizhou Feng,et al.  Building the Tree of Life on Terascale Systems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[44]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).