Automatic algorithm derivation and exploration in linear algebra for parallelism and locality

[1]  Robert A. van de Geijn,et al.  The science of deriving dense linear algebra algorithms , 2005, TOMS.

[2]  Saman P. Amarasinghe,et al.  Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[3]  Julien Langou,et al.  The Impact of Multicore on Math Software , 2006, PARA.

[4]  J. Ramanujam,et al.  Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Paolo Bientinesi,et al.  Knowledge-Based Automatic Generation of Partitioned Matrix Expressions , 2011, CASC.

[6]  William Jalby,et al.  Loop Optimization using Hierarchical Compilation and Kernel Decomposition , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[7]  Isak Jonsson,et al.  Recursive blocked algorithms for solving triangular systems—Part I: one-sided and coupled Sylvester-type matrix equations , 2002, TOMS.

[8]  Keshav Pingali,et al.  Data-Centric Transformations for Locality Enhancement , 2001, International Journal of Parallel Programming.

[9]  David A. Padua,et al.  A Parallel Numerical Solver Using Hierarchically Tiled Arrays , 2010, LCPC.

[10]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[11]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[12]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[13]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[14]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[16]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[17]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[18]  Ahmed H. Sameh,et al.  A parallel hybrid banded system solver: the SPIKE algorithm , 2006, Parallel Comput..