Scalable Dense Linear Algebra on Heterogeneous Hardware
暂无分享,去创建一个
Thomas Hérault | George Bosilca | Jack J. Dongarra | Stanimire Tomov | Piotr Luszczek | Jakub Kurzak | Anthony Danalis | Aurelien Bouteiller | J. Dongarra | P. Luszczek | Aurélien Bouteiller | G. Bosilca | J. Kurzak | Anthony Danalis | T. Hérault | S. Tomov
[1] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[2] R.H. Dennard,et al. Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.
[3] Jack J. Dongarra,et al. A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.
[4] Jack J. Dongarra,et al. Exploiting Fine-Grain Parallelism in Recursive LU Factorization , 2011, PARCO.
[5] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .
[6] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.
[7] Thomas Hérault,et al. Performance Portability of a GPU Enabled Factorization with the DAGuE Framework , 2011, 2011 IEEE International Conference on Cluster Computing.
[8] Emmanuel Agullo,et al. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[9] Franck Cappello,et al. Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..
[10] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[11] Jack J. Dongarra,et al. Scheduling dense linear algebra operations on multicore processors , 2010, Concurr. Comput. Pract. Exp..
[12] Herb Sutter,et al. The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .
[13] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.
[14] C. Loan,et al. A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .
[15] Lars Karlsson,et al. Distributed SBP Cholesky factorization algorithms with near-optimal scheduling , 2009, TOMS.
[16] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..
[17] Julien Langou,et al. Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..
[18] Thomas Hérault,et al. Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[19] Ed Anderson,et al. LAPACK Users' Guide , 1995 .
[20] Jack J. Dongarra,et al. Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.
[21] Jack J. Dongarra,et al. High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures , 2013, TOMS.
[22] Fred G. Gustavson,et al. Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..
[23] Allen D. Malony,et al. The open trace format (OTF) and open tracing for HPC , 2006, SC.
[24] Emmanuel Agullo,et al. Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures , 2010, VECPAR.
[25] Jack J. Dongarra,et al. Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[26] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.
[27] DongarraJack,et al. Parallel tiled QR factorization for multicore architectures , 2008 .
[28] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[29] Nicholas J. Higham,et al. INVERSE PROBLEMS NEWSLETTER , 1991 .
[30] Jack Dongarra,et al. QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .
[31] Thomas Hérault,et al. DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[32] James Reinders,et al. Intel® threading building blocks , 2008 .