Programming many‐core architectures ‐ a case study: dense matrix computations on the Intel single‐chip cloud computer processor

A message passing, distributed‐memory parallel computer on a chip is one possible design for future, many‐core architectures. We discuss initial experiences with the Intel Single‐chip Cloud Computer research processor, which is a prototype architecture that incorporates 48 cores on a single die that can communicate via a small, shared, on‐die buffer. The experiment is to port a state‐of‐the‐art, distributed‐memory, dense matrix library, Elemental, to this architecture and gain insight from the experience. We show that programmability addressed by this library, especially the proper abstraction for collective communication, greatly aids the porting effort. This enables us to support a wide range of functionality with limited changes to the library code. Copyright © 2011 John Wiley & Sons, Ltd.

[1]  Sergei Gorlatch,et al.  Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.

[2]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[3]  Tom Shanley,et al.  Pentium Processor System Architecture , 1993 .

[4]  Jesper Larsson Träff,et al.  The Hierarchical Factor Algorithm for All-to-All Communication (Research Note) , 2002, Euro-Par.

[5]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[6]  Robert Schreiber,et al.  Scalability of Sparse Direct Solvers , 1993 .

[7]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[8]  Saurabh Dighe,et al.  The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience: Research Articles , 2007 .

[10]  Robert A. van de Geijn,et al.  The science of deriving dense linear algebra algorithms , 2005, TOMS.

[11]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[12]  Robert A. van de Geijn,et al.  Two Dimensional Basic Linear Algebra Communication Subprograms , 1993, PPSC.

[13]  Jan Mayer,et al.  A numerical evaluation of preprocessing and ILU-type preconditioners for the solution of unsymmetric sparse linear systems using iterative methods , 2009, TOMS.

[14]  Robert A. van de Geijn,et al.  Representing linear algebra algorithms in code: the FLAME application program interfaces , 2005, TOMS.

[15]  G. W. Stewart Communication and matrix computations on large message passing systems , 1990, Parallel Comput..

[16]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[17]  Francisco D. Igual,et al.  Solving Linear Algebra Problems on Distributed-Memory Computers using Serial Codes , 2010 .

[18]  Timothy G. Mattson,et al.  Light-weight communications on Intel's single-chip cloud computer processor , 2011, OPSR.

[19]  Miodrag Potkonjak,et al.  Optimizing power using transformations , 1995, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[20]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[21]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[22]  S Quintana-OrtíEnrique,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009 .

[23]  Bruce Hendrickson,et al.  The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers , 1994, SIAM J. Sci. Comput..

[24]  Robert A. van de Geijn,et al.  Using PLAPACK - parallel linear algebra package , 1997 .

[25]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[26]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).