Composing parallel software efficiently with lithe

Applications composed of multiple parallel libraries perform poorly when those libraries interfere with one another by obliviously using the same physical cores, leading to destructive resource oversubscription. This paper presents the design and implementation of Lithe, a low-level substrate that provides the basic primitives and a standard interface for composing parallel codes efficiently. Lithe can be inserted underneath the runtimes of legacy parallel libraries to provide bolt-on composability without needing to change existing application code. Lithe can also serve as the foundation for building new parallel abstractions and libraries that automatically interoperate with one another. In this paper, we show versions of Threading Building Blocks (TBB) and OpenMP perform competitively with their original implementations when ported to Lithe. Furthermore, for two applications composed of multiple parallel libraries, we show that leveraging our substrate outperforms their original, even expertly tuned, implementations.

[1]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[2]  Mitchell Wand,et al.  Continuation-Based Multiprocessing , 1980, High. Order Symb. Comput..

[3]  Anoop Gupta,et al.  Process control and scheduling issues for multiprogrammed shared-memory multiprocessors , 1989, SOSP '89.

[4]  Lawrence W. Dowdy,et al.  Dynamic partitioning in a transputer environment , 1990, SIGMETRICS '90.

[5]  Evangelos P. Markatos,et al.  First-class user-level threads , 1991, SOSP '91.

[6]  Brian N. Bershad,et al.  Scheduler activations: effective kernel support for the user-level management of parallelism , 1991, TOCS.

[7]  Raj Vaswani,et al.  A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors , 1993, TOCS.

[8]  William E. Weihl,et al.  Lottery scheduling: flexible proportional-share resource management , 1994, OSDI '94.

[9]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[10]  Laxmikant V. Kalé,et al.  Threads for Interoperable Parallel Programming , 1996, LCPC.

[11]  Seth Copen Goldstein,et al.  Lazy Threads: Implementing a Fast Parallel Call , 1996, J. Parallel Distributed Comput..

[12]  Dean M. Tullsen,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[13]  Bryan Ford,et al.  CPU inheritance scheduling , 1996, OSDI '96.

[14]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[15]  Richard J. Enbody,et al.  Comparing gang scheduling with dynamic space sharing on symmetric multiprocessors using automatic self-allocating threads (ASAT) , 1997, Proceedings 11th International Parallel Processing Symposium.

[16]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[17]  John Regehr,et al.  Using hierarchical scheduling to support soft real-time applications in general-purpose operating systems , 2001 .

[18]  John H. Reppy,et al.  Compiler support for lightweight concurrency , 2002 .

[19]  Marvin Theimer,et al.  Cooperative Task Management Without Manual Stack Management , 2002, USENIX Annual Technical Conference, General Track.

[20]  George C. Necula,et al.  Capriccio: scalable threads for internet services , 2003, SOSP '03.

[21]  Ravi R. Iyer,et al.  CQoS: a framework for enabling QoS in shared caches of CMP platforms , 2004, ICS '04.

[22]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[23]  Simon L. Peyton Jones,et al.  Lightweight concurrency primitives for GHC , 2007, Haskell '07.

[24]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[25]  Katherine A. Yelick,et al.  Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[26]  Guang R. Gao,et al.  A parallel dynamic programming algorithm on a multi-core architecture , 2007, SPAA '07.

[27]  John H. Reppy,et al.  A scheduling framework for general-purpose parallel languages , 2008, ICFP.

[28]  Krste Asanovic,et al.  Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks , 2008, 2008 International Symposium on Computer Architecture.

[29]  Timothy Roscoe,et al.  30 seconds is not enough!: a study of operating system timer usage , 2008, Eurosys '08.

[30]  Christopher Hughes,et al.  Scalable HMM based inference engine in large vocabulary continuous speech recognition , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[31]  Katherine Yelick,et al.  Optimizing collective communication on multicores , 2009 .

[32]  Lapack Working Scheduling Linear Algebra Operations on Multicore Processors – , 2009 .

[33]  Roberto Ierusalimschy,et al.  Revisiting coroutines , 2009, TOPL.

[34]  Kevin Klues,et al.  Tessellation: space-time partitioning in a manycore client OS , 2009 .

[35]  Timothy A. Davis,et al.  Multifrontral multithreaded rank-revealing sparse QR factorization , 2009, Combinatorial Scientific Computing.

[36]  Jack Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010 .

[37]  Timothy A. Davis,et al.  Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization , 2011, TOMS.