Applying the roofline model
暂无分享,去创建一个
Georg Ofenbeck | Daniele G. Spampinato | Ruedi Steinmann | Victoria Caparrós Cabezas | Markus Püschel | Markus Püschel | Georg Ofenbeck | V. Cabezas | R. Steinmann
[1] R. Clint Whaley,et al. Achieving accurate and context‐sensitive timing for code optimization , 2008, Softw. Pract. Exp..
[2] H. T. Kung. Memory requirements for balanced computer architectures , 1986, ISCA '86.
[3] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[4] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.
[5] Jeanine Cook,et al. Toward Accurate Performance Evaluation using Hardware Counters , 2003 .
[6] David Padua,et al. Encyclopedia of Parallel Computing , 2011 .
[7] Ki-Hwan Kim,et al. Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model , 2011, Comput. Phys. Commun..
[8] Matthias Hauswirth,et al. Accuracy of performance counter measurements , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[9] Diego Rossinelli,et al. Mesh–particle interpolations on graphics processing units and multicore central processing units , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.
[10] C. Loan. Computational Frameworks for the Fast Fourier Transform , 1992 .
[11] Rob van Nieuwpoort,et al. Using many-core hardware to correlate radio astronomy signals , 2009, ICS.
[12] Gang Ren,et al. Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.
[13] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[14] Laxmikant V. Kalé,et al. Understanding Application Performance via Micro-benchmarks on Three Large Supercomputers: Intrepid, Ranger and Jaguar , 2010, Int. J. High Perform. Comput. Appl..
[15] Matthias Hauswirth,et al. We have it easy, but do we have it right? , 2008, 2008 IEEE International Symposium on Workload Characterization.
[16] Alan Jay Smith,et al. Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.
[17] F. A. Seiler,et al. Numerical Recipes in C: The Art of Scientific Computing , 1989 .
[18] Richard W. Vuduc,et al. Balance Principles for Algorithm-Architecture Co-Design , 2011, HotPar.
[19] George Ho,et al. PAPI: A Portable Interface to Hardware Performance Counters , 1999 .
[20] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.
[21] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..
[22] Richard W. Vuduc,et al. A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[23] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.
[24] Nathan R. Tallent,et al. HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..
[25] Chun Chen,et al. A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[26] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[27] David J. Lilja,et al. Measuring computer performance : A practitioner's guide , 2000 .