POSTER: Easy PRAM-based High-Performance Parallel Programming with ICE

Parallel machines have become more widely used. Unfortunately parallel programming technologies have advanced at a much slower pace except for regular programs. For irregular programs, this advancement is inhibited by high synchronization costs, non-loop parallelism, non-array data structures, recursively expressed parallelism and parallelism that is too fine-grained to be exploitable. We present ICE, a new parallel programming language that is easy-to-program, since: (i) ICE is a synchronous, lock-step language so there is no need for programmer-specified synchronization; (ii) for a PRAM algorithm its ICE program amounts to directly transcribing it; and (iii) the PRAM algorithmic theory offers unique wealth of parallel algorithms and techniques. We propose ICE to be a part of an ecosystem consisting of the XMT architecture, the PRAM algorithmic model, and ICE itself, that together deliver on the twin goal of easy programming and efficient parallelization of irregular programs. The XMT architecture, developed at UMD, can exploit fine-grained parallelism in irregular programs. We have built the ICE compiler which translates the ICE language into the multithreaded XMTC language; the significance of this is that multi-threading is a feature shared by practically all current scalable parallel programming languages thus providing a method to compile ICE code. As one indication of ease of programming, we observed a reduction in code size in 11 out of 16 benchmarks as compared to hand-optimized XMTC. For these programs, the average reduction in number of lines of code was 35.5 percent. The remaining 5 benchmarks had almost the same code size for both ICE and hand-optimized XMTC. Our main result is perhaps surprising: The run-time was comparable to XMTC with a 0.53 percent average gain for ICE across all benchmarks.

[1]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[2]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[3]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[4]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[5]  Uzi Vishkin,et al.  Parallel algorithms for Burrows-Wheeler compression and decompression , 2014, Theor. Comput. Sci..

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[7]  Guy E. Blelloch,et al.  A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[8]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[9]  Uzi Vishkin,et al.  Using Simple Abstraction to Guide the Reinvention of Computing for Parallelism , 2009 .

[10]  Uzi Vishkin,et al.  Truly parallel burrows-wheeler compression and decompression , 2013, SPAA.

[11]  George C. Caragea,et al.  Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform , 2006, Handbook of Parallel Computing.

[12]  Uzi Vishkin,et al.  Case study of gate-level logic simulation on an extremely fine-grained chip multiprocessor , 2006, J. Embed. Comput..

[13]  Uzi Vishkin,et al.  Explicit multi-threading (XMT) bridging models for instruction parallelism (extended abstract) , 1998, SPAA '98.

[14]  Fuat Keceli,et al.  Toolchain for Programming, Simulating and Studying the XMT Many-Core Architecture , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[15]  Ralph Grishman,et al.  The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract) , 1982, ISCA 1982.

[16]  Uzi Vishkin,et al.  Better speedups using simpler parallel programming for graph connectivity and biconnectivity , 2012, PMAM '12.

[17]  Uzi Vishkin,et al.  Fpga-based prototype of a pram-on-chip processor , 2008, CF '08.

[18]  Uzi Vishkin,et al.  Towards a first vertical prototyping of an extremely fine-grained parallel programming approach , 2001, SPAA '01.

[19]  Fuat Keceli,et al.  Resource-Aware Compiler Prefetching for Many-Cores , 2010, 2010 Ninth International Symposium on Parallel and Distributed Computing.

[20]  Alexandros Tzannes,et al.  Lazy binary-splitting: a run-time adaptive work-stealing scheduler , 2010, PPoPP '10.

[21]  John H. Reif,et al.  Prototyping parallel and distributed programs in Proteus , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[22]  Guy E. Blelloch,et al.  Vector Models for Data-Parallel Computing , 1990 .

[23]  Rajeev Barua,et al.  Easy PRAM-Based High-Performance Parallel Programming with ICE , 2016, IEEE Transactions on Parallel and Distributed Systems.

[24]  Zhengyu He,et al.  Dynamically tuned push-relabel algorithm for the maximum flow problem on CPU-GPU-Hybrid platforms , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[25]  Keshav Pingali,et al.  I-structures: data structures for parallel computing , 1986, Graph Reduction.

[26]  Uzi Vishkin,et al.  Programmer's Manual for XMTC Language, XMTC Compiler and XMT Simulator , 2006 .

[27]  K. Mani Chandy,et al.  Parallel program design - a foundation , 1988 .

[28]  David C. Cann,et al.  A Report on the Sisal Language Project , 1990, J. Parallel Distributed Comput..