Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems. (Ordonnancement et optimisations mémoire pour un solveur creux par méthodes directes sur des machines hétérogènes)
暂无分享,去创建一个
[1] Cleve Ashc Raft. The fan-both family of column-based distributed Cholesky factorization algorithms , 1993 .
[2] Jennifer A. Scott,et al. Design of a Multicore Sparse Cholesky Factorization Using DAGs , 2010, SIAM J. Sci. Comput..
[3] Mathieu Faverge,et al. Ordonnancement hybride statique-dynamique en algèbre linéaire creuse pour de grands clusters de machines NUMA et multi-coeurs , 2009 .
[4] Pierre Ramet,et al. A NUMA Aware Scheduler for a Parallel Sparse Direct Solver , 2009 .
[5] Robert W. Numrich,et al. Co-array Fortran for parallel programming , 1998, FORF.
[6] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..
[7] James Demmel,et al. A Supernodal Approach to Sparse Partial Pivoting , 1999, SIAM J. Matrix Anal. Appl..
[8] J. Roman,et al. On finding approximate supernodes for an efficient ILU(k) factorization , 2006 .
[9] Pierre Ramet,et al. Fine Grain Scheduling for Sparse Solver on Manycore Architectures , 2012 .
[10] Xiaoye S. Li. Evaluation of Sparse LU Factorization and Triangular Solution on Multicore Platforms , 2008, VECPAR.
[11] Pascal Hénon,et al. PaStiX: A High-Performance Parallel Direct Solver for Sparse Symmetric Definite Systems , 2000 .
[12] Julien Langou,et al. Algorithm 842: A set of GMRES routines for real and complex arithmetics on high performance computers , 2005, TOMS.
[13] George Bosilca,et al. Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach , 2012 .
[14] Ichitaro Yamazaki. PDSLin User Guide , 2011 .
[15] James Demmel,et al. SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems , 2003, TOMS.
[16] Yousef Saad,et al. ILUT: A dual threshold incomplete LU factorization , 1994, Numer. Linear Algebra Appl..
[17] Jack J. Dongarra,et al. An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..
[18] William Gropp,et al. MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.
[19] Iain S. Duff,et al. Sparse system solution and the HSL Library , 2006 .
[20] John E. Stone,et al. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.
[21] Mario Ricchiuto,et al. Comparison of high order algorithms in Aerosol and Aghora for compressible flows , 2013 .
[22] Christina Freytag,et al. Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .
[23] George Bosilca,et al. Toward a supernodal sparse direct solver over DAG runtimes , 2012 .
[24] Timothy A. Davis,et al. Algorithm 832: UMFPACK V4.3---an unsymmetric-pattern multifrontal method , 2004, TOMS.
[25] Brice Goglin,et al. Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective , 2009, IWOMP.
[26] Vipin Kumar,et al. WSSMP: A High-Performance Shared- and Distributed-Memory Parallel Sparse Symmetric Linear Equation Solver , 2007 .
[27] Olivier Czarny,et al. Bézier surfaces and finite elements for MHD simulations , 2008, J. Comput. Phys..
[28] Anshul Gupta,et al. Recent Progress in General Sparse Direct Solvers , 2001, International Conference on Computational Science.
[29] Jack J. Dongarra,et al. Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.
[30] Anamitra R. Choudhury,et al. Multifrontal Factorization of Sparse SPD Matrices on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[31] Patrick R. Amestoy,et al. Multifrontal parallel distributed symmetric and unsymmetric solvers , 2000 .
[32] Iain S. Du. Sparse system solution and the HSL Library , 2006 .
[33] Vladimir Volokhov,et al. Parallel geometric multigrid , 2016, Int. J. Comput. Sci. Math..
[34] Emmanuel Jeannot,et al. Compact DAG representation and its symbolic scheduling , 1999, J. Parallel Distributed Comput..
[35] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[36] Pierre Ramet,et al. Sparse direct solver on top of large-scale multicore systems with GPU accelerators , 2012 .
[37] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.
[38] Anoop Gupta,et al. An efficient block-oriented approach to parallel sparse Cholesky factorization , 1993, Supercomputing '93. Proceedings.
[39] J. Hogg. High performance Cholesky and symmetric indefinite factorizations with applications , 2010 .
[40] James Demmel,et al. Parallel Symbolic Factorization for Sparse LU with Static Pivoting , 2007, SIAM J. Sci. Comput..
[41] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.
[42] Anshul Gupta. A Shared- and distributed-memory parallel general sparse direct solver , 2007, Applicable Algebra in Engineering, Communication and Computing.
[43] Herb Sutter,et al. The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .
[44] Robert A. van de Geijn,et al. High performance dense linear algebra on a spatially distributed processor , 2008, PPoPP.
[45] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.
[46] Al Geist,et al. Task scheduling for parallel sparse Cholesky factorization , 1990, International Journal of Parallel Programming.
[47] Mahesh V. Joshi,et al. PSPASES: Scalable Parallel Direct Solver Library for Sparse Symmetric Positive Definite Linear Syste , 1999 .
[48] Jennifer A. Scott,et al. New Parallel Sparse Direct Solvers for Multicore Architectures , 2013, Algorithms.
[49] James Demmel,et al. LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.
[50] Jan Westerholm,et al. Efficient Assembly of Sparse Matrices Using Hashing , 2006, PARA.
[51] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[52] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.
[53] James Reinders,et al. Intel Xeon Phi Coprocessor High Performance Programming , 2013 .
[54] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.
[55] Michael Klemm,et al. OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.
[56] Dirk Eddelbuettel,et al. Benchmarking Single- and Multi-Core BLAS Implementations and GPUs for use with R , 2010 .
[57] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .
[58] G. Huysmans,et al. MHD stability in X-point geometry: simulation of ELMs , 2007 .
[59] Hans Werner Meuer,et al. Top500 Supercomputer Sites , 1997 .
[60] Victor Eijkhout,et al. A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling , 2014, ACM Trans. Math. Softw..
[61] YANQING CHEN,et al. Algorithm 8 xx : CHOLMOD , supernodal sparse Cholesky factorization and update / downdate ∗ , 2006 .
[62] James Reinders,et al. Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .
[63] Pierre Ramet,et al. A task-based sparse direct solver suited for large scale hierarchical/heterogeneous architectures , 2015 .
[64] Pascal Hénon,et al. A Parallel Direct/Iterative Solver Based on a Schur Complement Approach , 2008, 2008 11th IEEE International Conference on Computational Science and Engineering.
[65] P. Charrier,et al. Algorithmique et calculs de complexité pour un solveur de type dissections emboîtées , 1989 .
[66] Barry W. Peyton,et al. Block sparse Cholesky algorithms on advanced uniprocessor computers , 1991 .
[67] Edward Rothberg. Performance of Panel and Block Approaches to Sparse Cholesky Factorization on the iPSC/860 and Paragon Multicomputers , 1996, SIAM J. Sci. Comput..
[68] Ninghui Sun,et al. Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[69] Olaf Schenk,et al. Solving unsymmetric sparse systems of linear equations with PARDISO , 2002, Future Gener. Comput. Syst..
[70] Jack J. Dongarra,et al. An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.
[71] Richard W. Vuduc,et al. A Distributed CPU-GPU Sparse Direct Solver , 2014, Euro-Par.
[72] Jack Dongarra,et al. Sparse direct solvers with accelerators over DAG runtimes , 2012 .
[73] Thomas Hérault,et al. Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[74] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[75] Wolfgang Hackbusch,et al. Multi-grid methods and applications , 1985, Springer series in computational mathematics.
[76] Murat Efe Guney,et al. On the limits of GPU acceleration , 2010 .
[77] Pierre Ramet,et al. Dynamic scheduling for sparse direct solver on NUMA architectures , 2008 .
[78] Alfredo Buttari,et al. Fine-Grained Multithreading for the Multifrontal QR Factorization of Sparse Matrices , 2013, SIAM J. Sci. Comput..
[79] Jesús Labarta,et al. Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..
[80] Jack J. Dongarra,et al. EZTrace: A Generic Framework for Performance Analysis , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[81] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[82] Thomas Hérault,et al. DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[83] Robert A. van de Geijn,et al. SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.
[84] Julien Langou,et al. The Impact of Multicore on Math Software , 2006, PARA.
[85] Pascal Hénon,et al. PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems , 2002, Parallel Comput..
[86] Laxmikant V. Kalé,et al. CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.
[87] Patrick R. Amestoy,et al. An Approximate Minimum Degree Ordering Algorithm , 1996, SIAM J. Matrix Anal. Appl..
[88] Guillaume Mercier,et al. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.
[89] Emmanuel Jeannot,et al. Compact DAG Representation and Its Dynamic Scheduling , 1999, J. Parallel Distributed Comput..
[90] Jérémie Allard,et al. Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations , 2010, Euro-Par.
[91] Xavier Lacoste. Work stealing and granularity optimizations for a sparse solver on manycores , 2013 .
[92] Helmar Burkhart,et al. General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform , 2007 .
[93] Ümit V. Çatalyürek,et al. Improving performance of adaptive component-based dataflow middleware , 2012, Parallel Comput..
[94] Hyesoon Kim,et al. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[95] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .
[96] Ichitaro Yamazaki,et al. New Scheduling Strategies and Hybrid Programming for a Parallel Right-looking Sparse LU Factorization Algorithm on Multicore Cluster Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[97] A. Brandt. Algebraic multigrid theory: The symmetric case , 1986 .
[98] Timothy A. Davis,et al. Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.
[99] J. Pasciak,et al. Computer solution of large sparse positive definite systems , 1982 .
[100] Asim YarKhan,et al. Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .
[101] Robert E. Tarjan,et al. Algorithmic Aspects of Vertex Elimination on Graphs , 1976, SIAM J. Comput..
[102] James Demmel,et al. Making Sparse Gaussian Elimination Scalable by Static Pivoting , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[103] Jack Dongarra,et al. Fully Dynamic Scheduler for Numerical Computing on Multicore Processors , 2009 .
[104] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[105] Pascal Hénon. Distribution des données et régulation statique des calculs et des communications pour la résolution de grands systèmes linéaires creux par méthode directe , 2001 .
[106] Jack J. Dongarra,et al. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[107] Rohit Chandra,et al. Parallel programming in openMP , 2000 .
[108] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[109] Timothy A. Davis,et al. Accelerating sparse cholesky factorization on GPUs , 2014, IA3 '14.
[110] Azzam Haidar,et al. Parallel algebraic hybrid solvers for large 3D convection-diffusion problems , 2008, Numerical Algorithms.
[111] Patrick Amestoy,et al. A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..
[112] Joseph W. H. Liu. The role of elimination trees in sparse factorization , 1990 .
[113] Joseph W. H. Liu,et al. A Comparison of Three Column-Based Distributed Sparse Factorization Schemes. , 1990 .
[114] Pascal Hénon,et al. On finding approximate supernodes for an efficient block-ILU(k , 2008, Parallel Comput..
[115] Cédric Augonnet,et al. Scheduling Tasks over Multicore machines enhanced with acelerators: a Runtime System's Perspective , 2011 .
[116] Harvey Richardson,et al. High Performance Fortran: history, overview and current developments , 1996 .
[117] Stanimire Tomov,et al. One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators , 2012, ICCS.
[118] Roger Grimes,et al. Multifrontal Computations on GPUs and Their Multi-core Hosts , 2010, VECPAR.
[119] Helmar Burkhart,et al. Algorithmic performance studies on graphics processing units , 2008, J. Parallel Distributed Comput..
[120] Chenhan D. Yu,et al. A CPU-GPU hybrid approach for the unsymmetric multifrontal method , 2011, Parallel Comput..
[121] Jennifer A. Scott,et al. A Sparse Symmetric Indefinite Direct Solver for GPU Architectures , 2016, ACM Trans. Math. Softw..
[122] George Bosilca,et al. Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.
[123] Jean Roman,et al. Sparse Matrix Ordering with SCOTCH , 1997, HPCN Europe.
[124] Eduard Ayguadé,et al. Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..
[125] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.
[126] Victor Eijkhout,et al. Sparse direct factorizations through unassembled hyper-matrices , 2010 .
[127] Vipin Kumar,et al. Highly Scalable Parallel Algorithms for Sparse Matrix Factorization , 1997, IEEE Trans. Parallel Distributed Syst..
[128] Robert Schreiber,et al. Scalability of Sparse Direct Solvers , 1993 .
[129] T. Manteuffel. An incomplete factorization technique for positive definite linear systems , 1980 .
[130] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.
[131] Sivasankaran Rajamanickam,et al. ShyLU: A Hybrid-Hybrid Solver for Multicore Platforms , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[132] Michael T. Heath,et al. Parallel Algorithms for Sparse Linear Systems , 1991, SIAM Rev..
[133] Laxmikant V. Kalé,et al. Programming heterogeneous clusters with accelerators using object-based programming , 2011, Sci. Program..
[134] Robert A. van de Geijn,et al. The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations , 2012, J. Parallel Distributed Comput..
[135] J. W. Walker,et al. Direct solutions of sparse network equations by optimally ordered triangular factorization , 1967 .
[136] Jesús Labarta,et al. CellSs: Scheduling techniques to better exploit memory hierarchy , 2009, Sci. Program..
[137] E. Cuthill,et al. Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.
[138] Gene Poole,et al. Accelerating the ANSYS Direct Sparse Solver with GPUs , 2011 .
[139] Katherine Yelick,et al. Introduction to UPC and Language Specification , 2000 .
[140] Robert Schreiber,et al. Improved load distribution in parallel sparse Cholesky factorization , 1994, Proceedings of Supercomputing '94.
[141] Eduard Ayguadé,et al. An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.
[142] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[143] Cleve Ashcraft,et al. A Fan-In Algorithm for Distributed Sparse Numerical Factorization , 1990, SIAM J. Sci. Comput..
[144] Patrick Amestoy,et al. Hybridizing Nested Dissection and Halo Approximate Minimum Degree for Efficient Sparse Matrix Ordering , 1999, Concurr. Pract. Exp..
[145] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.
[146] Jean-Yves L'Excellent,et al. Introduction of shared-memory parallelism in a distributed-memory multifrontal solver , 2013 .
[147] Jaeyoung Choi,et al. A Proposal for a Set of Parallel Basic Linear Algebra Subprograms , 1995, PARA.