Data layout optimization techniques for modern and emerging architectures
暂无分享,去创建一个
[1] Martin F. Berman. A Method for Transposing a Matrix , 1958, JACM.
[2] Todd C. Mowry,et al. Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.
[3] Robert J. Harrison,et al. Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.
[4] Jichuan Chang,et al. Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.
[5] Antonio Gonzalez,et al. A data cache with multiple caching strategies tuned to different types of locality , 1995, International Conference on Supercomputing.
[6] Kunle Olukotun,et al. Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.
[7] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).
[8] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.
[9] Wen-mei W. Hwu,et al. Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.
[10] Bharadwaj S. Amrutur,et al. Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[11] Monica S. Lam,et al. A data locality optimizing algorithm (with retrospective) , 1991 .
[12] Monica S. Lam,et al. An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.
[13] Mahmut T. Kandemir,et al. A Layout-Conscious Iteration Space Transformation Technique , 2001, IEEE Trans. Computers.
[14] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .
[15] David A. Padua,et al. Estimating cache misses and locality using stack distances , 2003, ICS '03.
[16] Balaram Sinharoy,et al. POWER4 system microarchitecture , 2002, IBM J. Res. Dev..
[17] Mithuna Thottethodi,et al. Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.
[18] Gang Ren,et al. A comparison of empirical and model-driven optimization , 2003, PLDI '03.
[19] David A. Wood,et al. Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).
[20] Ken Kennedy,et al. Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries , 2001, J. Parallel Distributed Comput..
[21] Ken Kennedy,et al. Automatic decomposition of scientific programs for parallel execution , 1987, POPL '87.
[22] Jack J. Dongarra,et al. A proposal for a set of level 3 basic linear algebra subprograms , 1987, SGNM.
[23] Peter Kogge,et al. Generation of permutations for SIMD processors , 2005, LCTES '05.
[24] Monica S. Lam,et al. Data and computation transformations for multiprocessors , 1995, PPOPP '95.
[25] Peter Davies,et al. The TLB slice-a low-cost high-speed address translation mechanism , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[26] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..
[27] Yale N. Patt,et al. Utility-Based Cache Partitioning , 2006 .
[28] Larry Carter,et al. Towards an optimal bit-reversal permutation program , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).
[29] Fred G. Gustavson,et al. LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.
[30] Mahmut T. Kandemir,et al. Organizing the last line of defense before hitting the memory wall for CMPs , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).
[31] Chau-Wen Tseng,et al. Data transformations for eliminating conflict misses , 1998, PLDI.
[32] Srihari Makineni,et al. Communist, Utilitarian, and Capitalist cache policies on CMPs: Caches as a shared resource , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[33] Mithuna Thottethodi,et al. Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.
[34] Chau-Wen Tseng,et al. Eliminating conflict misses for high performance architectures , 1998, ICS '98.
[35] Anoop Gupta,et al. Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.
[36] Mahmut T. Kandemir,et al. Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.
[37] Albert Cohen,et al. Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.
[38] Yan Solihin,et al. Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..
[39] Jaehyuk Huh,et al. A NUCA substrate for flexible CMP cache sharing , 2005, ICS.
[40] William Pugh,et al. Minimizing communication while preserving parallelism , 1996, ICS '96.
[41] Scott A. Mahlke,et al. Compiler-managed partitioned data caches for low power , 2007, LCTES '07.
[42] James R. Goodman,et al. Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[43] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.
[44] Margo I. Seltzer,et al. Operating system benchmarking in the wake of lmbench: a case study of the performance of NetBSD on the Intel x86 architecture , 1997, SIGMETRICS '97.
[45] J. Ramanujam,et al. Global communication optimization for tensor contraction expressions under memory constraints , 2003, Proceedings International Parallel and Distributed Processing Symposium.
[46] Paul Feautrier,et al. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.
[47] Mahmut T. Kandemir,et al. Reducing NoC energy consumption through compiler-directed channel voltage scaling , 2006, PLDI '06.
[48] Sangyeun Cho,et al. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[49] James R. Larus,et al. Making Pointer-Based Data Structures Cache Conscious , 2000, Computer.
[50] Margo I. Seltzer,et al. Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design , 2005, USENIX Annual Technical Conference, General Track.
[51] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.
[52] Arnold L. Rosenberg,et al. Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.
[53] M. Naderi. Think globally... , 2004, HIV prevention plus!.
[54] Chun Chen,et al. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.
[55] Gang Ren,et al. Optimizing data permutations for SIMD devices , 2006, PLDI '06.
[56] David E. Bernholdt,et al. A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[57] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[58] Yannis Smaragdakis,et al. Adaptive Caches: Effective Shaping of Cache Behavior to Workloads , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[59] Susan Laflin,et al. Algorithm 380: in-situ transposition of a rectangular matrix [F1] , 1970, CACM.
[60] John R. Gilbert,et al. Automatic array alignment in data-parallel programs , 1993, POPL '93.
[61] Ken Kennedy,et al. Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse , 2000 .
[62] Mary W. Hall,et al. Custom data layout for memory parallelism , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[63] Monica S. Lam,et al. Maximizing Parallelism and Minimizing Synchronization with Affine Partitions , 1998, Parallel Comput..
[64] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.
[65] Wei Li,et al. Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.
[66] Michael F. P. O'Boyle,et al. Non-singular data transformations: definition, validity and applications , 1997, ICS '97.
[67] Rick Kufrin. Measuring and improving application performance with PerfSuite , 2005 .
[68] P. Feautrier. Parametric integer programming , 1988 .
[69] Larry Carter,et al. Memory hierarchy considerations for fast transpose and bit-reversals , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[70] Eduard Ayguadé,et al. A case for user-level dynamic page migration , 2000, ICS '00.
[71] Won-Taek Lim,et al. Architectural support for operating system-driven CMP cache management , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[72] William Pugh,et al. Constraint-based array dependence analysis , 1998, TOPL.
[73] Li Shang,et al. Dynamic voltage scaling with links for power optimization of interconnection networks , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..
[74] Sarita V. Adve,et al. Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.
[75] Qiang Wu,et al. Exposing memory access regularities using object-relative memory profiling , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[76] R. C. Whaley,et al. Timing high performance kernels through empirical compilation , 2005, 2005 International Conference on Parallel Processing (ICPP'05).
[77] Zhao Zhang,et al. Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors , 2000, SIAM J. Sci. Comput..
[78] Henry G. Dietz,et al. Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation , 1991, LCPC.
[79] Sanjay V. Rajopadhye,et al. Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.
[80] Saman P. Amarasinghe,et al. Maps: a compiler-managed memory system for Raw machines , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).
[81] Keshav Pingali,et al. Automatic measurement of memory hierarchy parameters , 2005, SIGMETRICS '05.
[82] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[83] Gustavo E. Scuseria,et al. Achieving Chemical Accuracy with Coupled-Cluster Theory , 1995 .
[84] David Tarditi,et al. Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.
[85] John Zahorjan,et al. Optimizing Data Locality by Array Restructuring , 1995 .
[86] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.
[87] Mark D. Hill,et al. Virtual hierarchies to support server consolidation , 2007, ISCA '07.
[88] Chau-Wen Tseng,et al. Improving data locality with loop transformations , 1996, TOPL.
[89] Chris H. Q. Ding,et al. An Optimal Index Reshuffle Algorithm for Multidimensional Arrays and Its Applications for Parallel Architectures , 2001, IEEE Trans. Parallel Distributed Syst..
[90] Mahmut T. Kandemir,et al. Profile-driven energy reduction in network-on-chips , 2007, PLDI '07.
[91] Balaram Sinharoy,et al. IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.
[92] Siddhartha Chatterjee,et al. Cache-efficient matrix transposition , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).
[93] Michael Stumm,et al. Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.
[94] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[95] Chandra Krintz,et al. Cache-conscious data placement , 1998, ASPLOS VIII.
[96] Chen Ding,et al. Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.
[97] Bradford L. Chamberlain,et al. ZPL: A Machine Independent Programming Language for Parallel Computers , 2000, IEEE Trans. Software Eng..