Exposing Tunable Parameters in Multi-threaded Numerical Code
暂无分享,去创建一个
Apan Qasem | Jichi Guo | Faizur Rahman | Qing Yi | Jichi Guo | Qing Yi | Faizur Rahman | Apan Qasem
[1] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[2] Ken Kennedy,et al. Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.
[3] William Thies,et al. A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[4] Chun Chen,et al. Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.
[5] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.
[6] Matteo Frigo. A Fast Fourier Transform Compiler , 1999, PLDI.
[7] Stephen F. Jenks,et al. The Synchronized Pipelined Parallelism Model , 2004 .
[8] Nathan R. Tallent,et al. HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..
[9] Richard W. Vuduc,et al. POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[10] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .
[11] Kostas Papadopoulos,et al. HelperCore_DB: Exploiting Multicore Technology for Databases , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).
[12] R. C. Whaley,et al. Automated transformation for performance-critical kernels , 2007, LCSD '07.
[13] Michael E. Wolf,et al. Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.
[14] Ken Kennedy,et al. Profitable loop fusion and tiling using model-driven empirical search , 2006, ICS '06.
[15] Allen,et al. Optimizing Compilers for Modern Architectures , 2004 .
[16] Franz Franchetti,et al. Discrete fourier transform on multicore , 2009, IEEE Signal Processing Magazine.
[17] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[18] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.
[19] Xipeng Shen,et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.
[20] Jack J. Dongarra,et al. Feedback-directed thread scheduling with memory considerations , 2007, HPDC '07.
[21] L. Almagor,et al. Finding effective compilation sequences , 2004, LCTES '04.
[22] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[23] David G. Wonnacott,et al. Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.
[24] Hsien-Hsin S. Lee,et al. Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[25] Jack Dongarra,et al. Parallel tiled QR factorization for multicore architectures , 2008 .