A methodology for speeding up matrix vector multiplication for single/multi-core architectures
暂无分享,去创建一个
Constantinos E. Goutis | Vasilios I. Kelefouras | Angeliki Kritikakou | Elissavet Papadima | C. Goutis | A. Kritikakou | V. Kelefouras | Elissavet Papadima
[1] Sameer Kulkarni,et al. An evaluation of different modeling techniques for iterative compilation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).
[2] M. Kaufmann,et al. Algorithms for SMP-clusters dense matrix-vector multiplication , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.
[3] Stefano Crespi-Reghizzi,et al. Continuous learning of compiler heuristics , 2013, TACO.
[4] Saman P. Amarasinghe,et al. Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.
[5] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[6] Hamid R. Arabnia,et al. A Parallel Algorithm for the Arbitrary Rotation of Digitized Images Using Process-and-Data-Decomposition Approach , 1990, J. Parallel Distributed Comput..
[7] Steven G. Johnson,et al. The Fastest Fourier Transform in the West , 1997 .
[8] Shlomit S. Pinter,et al. Register allocation with instruction scheduling: a new approach , 1996, Journal of Programming Languages.
[9] François Bodin,et al. A Machine Learning Approach to Automatic Production of Compiler Heuristics , 2002, AIMSA.
[10] Nan Zhang. A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication , 2012, IEEE Transactions on Parallel and Distributed Systems.
[11] Keith D. Cooper,et al. Adaptive Optimizing Compilers for the 21st Century , 2002, The Journal of Supercomputing.
[12] Hamid R. Arabnia,et al. Parallel stereocorrelation on a reconfigurable multi-ring network , 1996, The Journal of Supercomputing.
[13] Franz Franchetti,et al. Computer Generation of Hardware for Linear Digital Signal Processing Transforms , 2012, TODE.
[14] Konstantinos G. Margaritis,et al. Performance Models for Matrix Computations on Multicore Processors Using OpenMP , 2010, 2010 International Conference on Parallel and Distributed Computing, Applications and Technologies.
[15] David I. August,et al. Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..
[16] A KulkarniPrasad,et al. Practical exhaustive optimization phase order exploration and evaluation , 2009 .
[17] Gary S. Tyson,et al. Practical exhaustive optimization phase order exploration and evaluation , 2009, TACO.
[18] Douglas L. Jones,et al. Fast searches for effective optimization phase sequences , 2004, PLDI '04.
[19] Hamid R. Arabnia,et al. Parallel Edge-Region-Based Segmentation Algorithm Targeted at Reconfigurable MultiRing Network , 2003, The Journal of Supercomputing.
[20] Hamid R. Arabnia,et al. The REFINE Multiprocessor - Theoretical Properties and Algorithms , 1995, Parallel Comput..
[21] H.R. Arabnia,et al. A Transputer Network for Fast Operations on Digitised Images , 1989, Comput. Graph. Forum.
[22] Nicholas Nethercote,et al. Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.
[23] Hamid R. Arabnia,et al. A distributed stereocorrelation algorithm , 1995, Proceedings of Fourth International Conference on Computer Communications and Networks - IC3N'95.
[24] Noriyuki Fujimoto. Dense Matrix-Vector Multiplication on the CUDA Architecture , 2008, Parallel Process. Lett..
[25] Hamid R. Arabnia,et al. A Transputer Network for the Arbitrary Rotation of Digitised Images , 1987, Comput. J..
[26] David F. Bacon,et al. Compiler transformations for high-performance computing , 1994, CSUR.
[27] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[28] Hamid R. Arabnia,et al. Arbitrary Rotation of Raster Images with SIMD Machine Architectures , 1987, Comput. Graph. Forum.
[29] Michael F. P. O'Boyle,et al. Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[30] N. Fujimoto,et al. Faster matrix-vector multiplication on GeForce 8800GTX , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[31] Martin Schulz,et al. On the Performance of an Algebraic Multigrid Solver on Multicore Clusters , 2010, VECPAR.
[32] Alexander Krivutsenko. GotoBLAS - Anatomy of a fast matrix multiplication High performance libraries in computational science , 2008 .
[33] E. Granston,et al. Automatic Recommendation of Compiler Options , 2001 .
[34] Todd M. Austin,et al. The SimpleScalar tool set, version 2.0 , 1997, CARN.
[35] Michael F. P. O'Boyle,et al. A Feasibility Study in Iterative Compilation , 1999, ISHPC.
[36] Hans Henrik Brandenborg Sørensen,et al. High-Performance Matrix-Vector Multiplication on the GPU , 2011, Euro-Par Workshops.
[37] Nectarios Koziris,et al. Performance evaluation of the sparse matrix-vector multiplication on modern architectures , 2009, The Journal of Supercomputing.
[38] R. C. Whaley,et al. Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005, Softw. Pract. Exp..
[39] S.M. Bhandarkar,et al. The Hough Transform on a Reconfigurable Multi-Ring Network , 1995, J. Parallel Distributed Comput..
[40] Steven J. Plimpton,et al. An Efficient Parallel Algorithm for Matrix-Vector Multiplication , 1995, Int. J. High Speed Comput..
[41] Hamid R. Arabnia,et al. A Reconfigurable Architecture for Image Processing and Computer Vision , 1995, Int. J. Pattern Recognit. Artif. Intell..
[42] Antoine Petitet,et al. Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005 .
[43] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[44] Vivek N. Waghmare,et al. Performance Analysis of Matrix-Vector Multiplication in Hybrid (MPI + OpenMP) , 2011 .
[45] Ghassan Shobaki,et al. Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach , 2013, ACM Trans. Archit. Code Optim..
[46] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.