Performance evaluation of concurrent collections on high-performance multicore computing systems

This paper is the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art multicore systems: (i) a recently proposed asynchronous-parallel Cholesky factorization algorithm, (ii) a novel and non-trivial “higher-level” partly-asynchronous generalized eigensolver for dense symmetric matrices. Given a well-tuned sequential BLAS, our implementations match or exceed competing multithreaded vendor-tuned codes by up to 2.6×. Our evaluation compares with alternative models, including ScaLAPACK with a shared memory MPI, OpenMP, Cilk++, and PLASMA 2.0, on Intel Harpertown, Nehalem, and AMD Barcelona systems. Looking forward, we identify new opportunities to improve the CnC language and runtime scheduling and execution.

[1]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[2]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[3]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[4]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[5]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[6]  Carl D. Offner,et al.  TStreams : A Model of Parallel Computation ( Preliminary Report ) , .

[7]  Fred G. Gustavson New Generalized Data Structures for Matrices Lead to a Variety of High Performance Dense Linear Algebra Algorithms , 2004, PARA.

[8]  Kathleen Knobe,et al.  Ease of use with concurrent collections (CnC) , 2009 .

[9]  Jack Dongarra,et al.  Scheduling Two-sided Transformations using Algorithms-by-Tiles on Multicore Architectures LAPACK Working Note # 214 , 2009 .

[10]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[11]  Fred G. Gustavson,et al.  LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[12]  Vivek Sarkar,et al.  Multi-core Implementations of the Concurrent Collections Programming Model , 2008 .

[13]  Nicholas Carriero,et al.  Linda in context , 1989, CACM.

[14]  島田 俊夫 20世紀の名著名論:J. B. Dennis : First Version of a Data Flow Procedure Language , 2003 .