Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach

Abstract Explicit multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. XMT introduces a computational framework with (1) a simple programming style that relies on fine-grained PRAM-style algorithms; (2) hardware support for low-overhead parallel threads, scalable load balancing, and efficient synchronization. The missing link between the algorithmic-programming level and the architecture level is provided by the first prototype XMT compiler. This paper also takes this new opportunity to evaluate the overall effectiveness of the interaction between the programming model and the hardware, and enhance its performance where needed, incorporating new optimizations into the XMT compiler. We present a wide range of applications, which written in XMT obtain significant speedups relative to the best serial programs. We show that XMT is especially useful for more advanced applications with dynamic, irregular access patterns, where for regular computations we demonstrate performance gains that scale up to much higher levels than have been demonstrated before for on-chip systems.

[1]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[2]  M. Morris Mano,et al.  Digital Logic and Computer Design , 1979 .

[3]  Allan Gottlieb,et al.  Highly parallel computing , 1989, Benjamin/Cummings Series in computer science and engineering.

[4]  Richard Cole,et al.  The APRAM: incorporating asynchrony into the PRAM model , 1989, SPAA '89.

[5]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[6]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[7]  Allan Gottlieb,et al.  Process coordination with fetch-and-increment , 1991, ASPLOS IV.

[8]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[9]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[10]  Marc Levoy,et al.  Volume rendering on scalable shared-memory MIMD architectures , 1992, VVS.

[11]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[12]  Manoj Franklin,et al.  The multiscalar architecture , 1993 .

[13]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[14]  Marc Levoy,et al.  Parallel visualization algorithms: performance and architectural implications , 1994, Computer.

[15]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[16]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[17]  Guang R. Gao,et al.  A design study of the EARTH multiprocessor , 1995, PACT.

[18]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[19]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[20]  Uzi Vishkin,et al.  From algorithm parallelism to instruction-level parallelism: an encode-decode chain using prefix-sum , 1997, SPAA '97.

[21]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[22]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[23]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[24]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[25]  Joel H. Saltz,et al.  Active disks: programming model, algorithms and evaluation , 1998, ASPLOS VIII.

[26]  Uzi Vishkin,et al.  XMT-M: A Scalable Decentralized Processor , 1999 .

[27]  Dean M. Tullsen,et al.  Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[28]  Michael Dahlin,et al.  Emulations between QSM, BSP, and LogP: a framework for general-purpose parallel algorithm design , 1999, SODA '99.

[29]  Uzi Vishkin,et al.  Experiments with list ranking for explicit multi-threaded (XMT) instruction parallelism , 1999, JEAL.

[30]  Stamatis Vassiliadis,et al.  Parallel Computer Architecture , 2000, Euro-Par.

[31]  Uzi Vishkin,et al.  Evaluating Multi-threading in the Prototype XMT Environment , 2000 .

[32]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[33]  Uzi Vishkin A no-busy-wait balanced tree parallel algorithmic paradigm , 2000, SPAA '00.

[34]  Leonid Oliker,et al.  A Comparison of Three Programming Models for Adaptive Applications on the Origin2000 , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[35]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[36]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[37]  Steven J. Deitz,et al.  A Comparative Study of the NAS MG Benchmark across Parallel Languages and Architectures , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[38]  D. S. Henty,et al.  Performance of Hybrid Message-Passing and Shared-Memory Parallelism for Discrete Element Modeling , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[39]  W. Paul,et al.  Computer Architecture , 2000, Springer Berlin Heidelberg.

[40]  Uzi Vishkin,et al.  Evaluating the XMT parallel programming model , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.