MultiMLton: A multicore-aware runtime for standard ML

MultiMLton is an extension of the MLton compiler and runtime system that targets scalable, multicore architectures. It provides specific support for ACML, a derivative of Concurrent ML that allows for the construction of composable asynchronous events. To effectively manage asynchrony, we require the runtime to efficiently handle potentially large numbers of lightweight, short-lived threads, many of which are created specifically to deal with the implicit concurrency introduced by asynchronous events. Scalability demands also dictate that the runtime minimize global coordination. MultiMLton therefore implements a split-heap memory manager that allows mutators and collectors running on different cores to operate mostly independently. More significantly, MultiMLton exploits the premise that there is a surfeit of available concurrency in ACML programs to realize a new collector design that completely eliminates the need for read barriers, a source of significant overhead in other managed runtimes. These two symbiotic features - a thread design specifically tailored to support asynchronous communication, and a memory manager that exploits lightweight concurrency to greatly reduce barrier overheads - are MultiMLton 's key novelties. In this article, we describe the rationale, design, and implementation of these features, and provide experimental results over a range of parallel benchmarks and different multicore architectures including an 864 core Azul Vega 3, and a 48 core non-coherent Intel SCC (Single-Cloud Computer), that justify our design decisions.

[1]  Seth Copen Goldstein,et al.  Lazy Threads: Implementing a Fast Parallel Call , 1996, J. Parallel Distributed Comput..

[2]  Mark Baker,et al.  MPJ: A Proposed Java Message Passing API and Environment for High Performance Computing , 2000, IPDPS Workshops.

[3]  Mitchell Wand,et al.  Continuation-Based Multiprocessing , 1980, High. Order Symb. Comput..

[4]  D. J. Raymond SISAL: A Safe and Efficient Language for Numerical Calculations , 2000 .

[5]  R. Kent Dybvig,et al.  Representing control in the presence of one-shot continuations , 1996, PLDI '96.

[6]  Robert H. Halstead,et al.  Mul-T: a high-performance parallel Lisp , 1989, PLDI '89.

[7]  John H. Reppy,et al.  Concurrent programming in ML , 1999 .

[8]  Lars Bergstrom,et al.  Garbage collection for multicore NUMA machines , 2011, MSPC '11.

[9]  Doug Lea,et al.  Concurrent Programming In Java , 1996 .

[10]  Stuart C. Shapiro,et al.  MULTI - a LISP based multiprocessing system , 1980, LISP Conference.

[11]  Guodong Li,et al.  Formal specification of the MPI-2.0 standard in TLA+ , 2008, PPOPP.

[12]  James S. Miller Implementing a Scheme-based parallel processing system , 2005, International Journal of Parallel Programming.

[13]  Damien Doligez,et al.  A concurrent, generational garbage collector for a multithreaded implementation of ML , 1993, POPL '93.

[14]  Simon L. Peyton Jones,et al.  Multicore garbage collection with local heaps , 2011, ISMM '11.

[15]  Peter Lee,et al.  Safe-for-Space Threads in Standard ML , 1998, High. Order Symb. Comput..

[16]  Andrew W. Appel,et al.  Simple generational garbage collection and fast allocation , 1989, Softw. Pract. Exp..

[17]  Rodney A. Brooks,et al.  Trading data space for reduced time and code space in real-time garbage collection on stock hardware , 1984, LFP '84.

[18]  Richard P. Gabriel,et al.  Qlisp: Experience and New Directions , 1988, PPOPP/PPEALS.

[19]  Tao Yang,et al.  Optimizing threaded MPI execution on SMP clusters , 2001, ICS '01.

[20]  Doug Lea Concurrent Programming in Java. Second Edition: Design Principles and Patterns , 1999 .

[21]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[22]  Rafael Dueire Lins,et al.  Benchmarking implementations of functional languages with ‘Pseudoknot’, a float-intensive benchmark , 1996, Journal of Functional Programming.

[23]  Claes Wikström,et al.  Concurrent programming in ERLANG (2nd ed.) , 1996 .

[24]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[25]  Doug Lea,et al.  Concurrent programming in Java - design principles and patterns , 1996, Java series.

[26]  S. L. Graham,et al.  List Processing in Real Time on a Serial Computer , 1978 .

[27]  Simon L. Peyton Jones,et al.  Haskell on a shared-memory multiprocessor , 2005, Haskell '05.

[28]  V. T. Rajan,et al.  A real-time garbage collector with low overhead and consistent utilization , 2003, POPL '03.

[29]  Simon Peyton Jones,et al.  The Glasgow Haskell Compiler , 2012 .

[30]  Stephen M. Blackburn,et al.  Barriers: friend or foe? , 2004, ISMM '04.

[31]  Bjarne Steensgaard,et al.  Thread-specific heaps for multi-threaded programs , 2000, ISMM '00.

[32]  Lars-Åke Fredlund,et al.  A unified semantics for future Erlang , 2010, Erlang '10.

[33]  Joe Armstrong,et al.  Concurrent programming in ERLANG , 1993 .

[34]  Richard E. Jones,et al.  A fast analysis for thread-local garbage collection with dynamic class loading , 2005, Fifth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'05).

[35]  Arvind,et al.  Implicit parallel programming in pH , 2001 .

[36]  Marc Feeley,et al.  A parallel virtual machine for efficient scheme compilation , 1990, LISP and Functional Programming.

[37]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[38]  Avik Chaudhuri,et al.  A concurrent ML library in concurrent Haskell , 2009, ICFP.

[39]  Suresh Jagannathan,et al.  Eliminating read barriers through procrastination and cleanliness , 2012, ISMM '12.

[40]  Guy L. Steele,et al.  Multiprocessing compactifying garbage collection , 1975, CACM.

[41]  Claudio V. Russo,et al.  Parallel concurrent ML , 2009, ICFP.

[42]  Suresh Jagannathan,et al.  Composable asynchronous events , 2011, PLDI '11.

[43]  Todd A. Anderson Optimizations in a private nursery-based garbage collector , 2010, ISMM '10.

[44]  Suresh Jagannathan,et al.  Lightweight asynchrony using parasitic threads , 2010, DAMP '10.

[45]  Yuxiong He,et al.  Adaptive work-stealing with parallelism feedback , 2008, TOCS.

[46]  Matthias Felleisen,et al.  Control operators, the SECD-machine, and the λ-calculus , 1987, Formal Description of Programming Concepts.