Automatic tuning of sparse matrix-vector multiplication on multicore clusters
暂无分享,去创建一个
Changjun Hu | Shigang Li | Yunquan Zhang | Junchao Zhang | Junchao Zhang | Changjun Hu | Yunquan Zhang | Shigang Li
[1] Torsten Hoefler,et al. Improved MPI collectives for MPI processes in shared address spaces , 2014, Cluster Computing.
[2] Tarek A. El-Ghazawi,et al. Load-balancing in sparse matrix-vector multiplication , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.
[3] Jingling Xue,et al. Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs , 2012, 2012 41st International Conference on Parallel Processing.
[4] Robert W. Numrich,et al. Co-array Fortran for parallel programming , 1998, FORF.
[5] Michael Garland,et al. Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[6] Matthias S. Müller,et al. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.
[7] Rob H. Bisseling,et al. Communication balancing in parallel sparse matrix-vector multiplication , 2005 .
[8] Wilfred Pinfold,et al. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , 2009, HiPC 2009.
[9] Nicholas J. Wright,et al. Accelerating Applications at Scale Using One-Sided Communication , 2012 .
[10] Umit Catalyurek,et al. Constrained Fine-Grain Parallel Sparse Matrix Distribution , 2006 .
[11] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[12] Jimmy Su,et al. Automatic support for irregular computations in a high-level language , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[13] Katherine Yelick,et al. Titanium Language Reference Manual (Version 2.20) , 2006 .
[14] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.
[15] Ankit Jain. pOSKI : An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures , 2008 .
[16] Rudolf Eigenmann,et al. Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems , 2008, ICS '08.
[17] Richard W. Vuduc,et al. Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..
[18] Nico M. Temme,et al. Computation of the Marcum Q-function , 2013, ArXiv.
[19] Rajesh Nishtala,et al. Architectural Probes for Measuring Communication Overlap Potential , 2006 .
[20] Katherine A. Yelick,et al. Hybrid PGAS runtime support for multicore nodes , 2010, PGAS '10.
[21] Laxmikant V. Kalé,et al. Automatic MPI to AMPI Program Transformation Using Photran , 2010, Euro-Par Workshops.
[22] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .
[23] Torsten Hoefler,et al. NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.
[24] Rolf Rabenseifner,et al. Hybrid Parallel Programming on HPC Platforms , 2003 .
[25] Georg Hager,et al. Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.
[26] Torsten Hoefler,et al. Ownership passing: efficient distributed memory programming on multi-core systems , 2013, PPoPP '13.
[27] Katherine Yelick,et al. Titanium Language Reference Manual, version 2.19 , 2005 .