Architectural Support for Exploiting Fine Grain Parallelism

The advent of multi-core processors, particularly with projections that numbers of cores will continue to increase, has focused attention on parallel programming. It is widely recognized that current programming techniques, including those that are used for scientific parallel programming, will not allow the easy formulation of general purpose applications. An area which is receiving interest is the use of programming styles which are side-effect free. Previous work on parallel functional programming demonstrated the potential of this to permit the easy exploitation of parallelism. Recent systems like Cilk use conventional languages such as C but encourage the use of a largely functional style (side-effect free) when writing programs. An important part of the Cilk runtime is a system to balance the usage of cores. In this paper we present SLAM (Spreading Load with Active Messages), a dynamic load balancing system based on functional language evaluation techniques. We show that SLAM, provided with appropriate hardware support, significantly outperforms the Cilk system. We evaluated our system using tiled CMPs with private and shared L2 caches separately. Our results show that, for the benchmarks evaluated, SLAM outperforms Cilk by 28% on average when using 32-core CMPs with private L2 caches. For the case of the CMPs with shared L2 caches, SLAM was on average 21% faster than Cilk when using 32 cores and 62% faster when using 64 cores.

[1]  John Paul Shen,et al.  Multiple Instruction Stream Processor , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[2]  Karl-Filip Faxén Efficient Work Stealing for Fine Grained Parallelism , 2010, 2010 39th International Conference on Parallel Processing.

[3]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[4]  Niraj K. Jha,et al.  Garnet : A Detailed Interconnect Model Inside a Full-System Simulation Framework , .

[5]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[6]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[7]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[8]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[9]  Paraskevas Evripidou,et al.  TFlux: A Portable Platform for Data-Driven Multithreading on Commodity Multicore Systems , 2008, 2008 37th International Conference on Parallel Processing.

[10]  Akinori Yonezawa,et al.  StackThreads/MP: integrating futures into calling standards , 1999, PPoPP '99.

[11]  Sebastian Burckhardt,et al.  The design of a task parallel library , 2009, OOPSLA.

[12]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[13]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[14]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[15]  Christoforos E. Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[16]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[17]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[18]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[19]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[20]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[21]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[22]  Andrei Sergeevich Terechko,et al.  A Hardware Task Scheduler for Embedded Video Processing , 2008, HiPEAC.

[23]  Mats Brorsson,et al.  A Comparison of some recent Task-based Parallel Programming Models , 2010 .

[24]  Margaret Martonosi,et al.  Hardware-modulated parallelism in chip multiprocessors , 2005, CARN.

[25]  Magnus Själander,et al.  A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures , 2008, 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools.

[26]  Alexey Kukanov,et al.  The Foundations for Scalable Multicore Software in Intel Threading Building Blocks , 2007 .

[27]  Hong Jiang,et al.  Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[29]  Eduard Ayguadé,et al.  Task Superscalar: An Out-of-Order Task Pipeline , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.