A LIGHTWEIGHT RUN-TIME SUPPORT FOR FAST DENSE LINEAR ALGEBRA ON MULTI-CORE

The work proposes MDF , a lightweight dynamic run-time support able to achieve high performance in the execution of dense linear algebra kernels on shared-cache multi-core. MDF implements a dynamic macro-dataow interpreter processing DAG graphs generated on-the-y out of standard numeric kernel code. The experimental results demonstrate that the performance obtained using MDF on both ne-grain and coarse-grain problems is comparable with or even better than that achieved by de-facto standard solutions (notably PLASMA library), which use separate run-time supports specically optimised for dierent computational grains on modern multi-core.

[1]  Heinrich Meyr,et al.  High level software synthesis for signal processing systems , 1992, [1992] Proceedings of the International Conference on Application Specific Array Processors.

[2]  Jack Dongarra,et al.  Fully Dynamic Scheduler for Numerical Computing on Multicore Processors , 2009 .

[3]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[4]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[5]  Arthur H. Veen,et al.  Dataflow machine architecture , 1986, CSUR.

[6]  Jack Dongarra,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .

[7]  Peter Kilpatrick,et al.  Accelerating Code on Multi-cores with FastFlow , 2011, Euro-Par.

[8]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[9]  Gabriele Mencagli,et al.  EVALUATION OF ARCHITECTURAL SUPPORTS FOR FINE-GRAINED SYNCHRONIZATION MECHANISMS , 2013 .

[10]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[11]  Peter Kilpatrick,et al.  An Efficient Unbounded Lock-Free Queue for Multi-core Systems , 2012, Euro-Par.

[12]  Horacio González-Vélez,et al.  A survey of algorithmic skeleton frameworks: high‐level structured parallel programming enablers , 2010, Softw. Pract. Exp..

[13]  Jack J. Dongarra,et al.  A scalable framework for heterogeneous GPU-based clusters , 2012, SPAA '12.

[14]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[15]  Jack J. Dongarra,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[16]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[17]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[18]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[19]  Marco Danelutto,et al.  Parallel Patterns for General Purpose Many-Core , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[20]  Jack J. Dongarra,et al.  Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures , 2011, Concurr. Comput. Pract. Exp..