论文信息 - Performance evaluation of concurrent collections on high-performance multicore computing systems

Performance evaluation of concurrent collections on high-performance multicore computing systems

This paper is the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art multicore systems: (i) a recently proposed asynchronous-parallel Cholesky factorization algorithm, (ii) a novel and non-trivial “higher-level” partly-asynchronous generalized eigensolver for dense symmetric matrices. Given a well-tuned sequential BLAS, our implementations match or exceed competing multithreaded vendor-tuned codes by up to 2.6×. Our evaluation compares with alternative models, including ScaLAPACK with a shared memory MPI, OpenMP, Cilk++, and PLASMA 2.0, on Intel Harpertown, Nehalem, and AMD Barcelona systems. Looking forward, we identify new opportunities to improve the CnC language and runtime scheduling and execution.

[1] Robert A. van de Geijn,et al. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[2] Jack B. Dennis,et al. First version of a data flow procedure language , 1974, Symposium on Programming.

[3] William Thies,et al. StreamIt: A Language for Streaming Applications , 2002, CC.

[4] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[5] Jesús Labarta,et al. A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[6] Carl D. Offner,et al. TStreams : A Model of Parallel Computation ( Preliminary Report ) , .

[7] Fred G. Gustavson. New Generalized Data Structures for Matrices Lead to a Variety of High Performance Dense Linear Algebra Algorithms , 2004, PARA.

[8] Kathleen Knobe,et al. Ease of use with concurrent collections (CnC) , 2009 .

[9] Jack Dongarra,et al. Scheduling Two-sided Transformations using Algorithms-by-Tiles on Multicore Architectures LAPACK Working Note # 214 , 2009 .

[10] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[11] Fred G. Gustavson,et al. LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[12] Vivek Sarkar,et al. Multi-core Implementations of the Concurrent Collections Programming Model , 2008 .

[13] Nicholas Carriero,et al. Linda in context , 1989, CACM.

[14] 島田俊夫. 20世紀の名著名論：J. B. Dennis : First Version of a Data Flow Procedure Language , 2003 .