Co-optimizing memory-level parallelism and cache-level parallelism
暂无分享,去创建一个
[1] Mahmut T. Kandemir,et al. Quantifying and Optimizing Data Access Parallelism on Manycores , 2018, 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).
[2] Mahmut T. Kandemir,et al. Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[3] Gerda Janssens,et al. Multi-dimensional incremental loop fusion for data locality , 2003, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003.
[4] Mahmut T. Kandemir,et al. Race-To-Sleep + Content Caching + Display Caching: A Recipe for Energy-efficient Video Streaming on Handhelds , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[5] Mahmut T. Kandemir,et al. Data Movement Aware Computation Partitioning , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[6] Onur Mutlu,et al. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[7] Mahmut T. Kandemir,et al. Addressing End-to-End Memory Access Latency in NoC-Based Multicores , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[8] Calvin Lin,et al. Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[9] Mahmut T. Kandemir,et al. POSTER: Location-Aware Computation Mapping for Manycore Processors , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[10] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.
[11] Mahmut T. Kandemir,et al. Meeting midway: Improving CMP performance with memory-side prefetching , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[12] Onur Mutlu,et al. Improving memory Bank-Level Parallelism in the presence of prefetching , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[13] Calvin Lin,et al. Adaptive History-Based Memory Schedulers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).
[14] Josep Torrellas,et al. PageForge: A Near-Memory Content-Aware Page-Merging Architecture , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[15] Monica S. Lam,et al. Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.
[16] John Zahorjan,et al. Optimizing Data Locality by Array Restructuring , 1995 .
[17] Mary W. Hall,et al. Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.
[18] Paul Feautrier,et al. Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.
[19] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.
[20] Monica S. Lam,et al. A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..
[21] Wei Li,et al. Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.
[22] Mahmut T. Kandemir,et al. Optimizing off-chip accesses in multicores , 2015, PLDI.
[23] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.
[24] Michael F. P. O'Boyle,et al. Integrating Loop and Data Transformations for Global Optimization , 2002, J. Parallel Distributed Comput..
[25] Mahmut T. Kandemir,et al. A Matrix-Based Approach to Global Locality Optimization , 1999, J. Parallel Distributed Comput..
[26] Mahmut T. Kandemir,et al. Quantifying Data Locality in Dynamic Parallelism in GPUs , 2018, Proc. ACM Meas. Anal. Comput. Syst..
[27] Gurindar S. Sohi,et al. High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.
[28] Mahmut T. Kandemir,et al. Opportunistic Computing in GPU Architectures , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).
[29] Wei Li,et al. Compiling for NUMA Parallel Machines , 1993 .
[30] Onur Mutlu,et al. A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).
[31] Mahmut T. Kandemir,et al. Improving bank-level parallelism for irregular applications , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[32] Brian Fahs,et al. Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[33] Stijn Eyerman,et al. A Memory-Level Parallelism Aware Fetch Policy for SMT Processors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[34] Sarita V. Adve,et al. Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.
[35] Monica S. Lam,et al. Array-data flow analysis and its use in array privatization , 1993, POPL '93.
[36] Mor Harchol-Balter,et al. ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .
[37] M. Kandemir,et al. Computing with Near Data , 2019, SIGMETRICS.
[38] Mahmut T. Kandemir,et al. Compiler Support for Optimizing Memory Bank-Level Parallelism , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[39] Mateo Valero,et al. Static locality analysis for cache management , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.
[40] Monica S. Lam,et al. An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.
[41] Mahmut T. Kandemir,et al. A Layout-Conscious Iteration Space Transformation Technique , 2001, IEEE Trans. Computers.
[42] Onur Mutlu,et al. Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, ISCA.
[43] Uday Bondhugula,et al. PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System , 2015 .
[44] Chau-Wen Tseng,et al. Compiler optimizations for improving data locality , 1994, ASPLOS VI.
[45] Mahmut T. Kandemir,et al. DEMM: A Dynamic Energy-Saving Mechanism for Multicore Memories , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).
[46] Vivien Quéma,et al. Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.
[47] Keshav Pingali,et al. Data-centric multi-level blocking , 1997, PLDI '97.
[48] Olivier Temam,et al. Data caches for superscalar processors , 1997, ICS '97.
[49] Yen-Chen Liu,et al. Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.
[50] Mahmut T. Kandemir,et al. Enhancing computation-to-core assignment with physical location information , 2018, PLDI.
[51] Lieven Eeckhout,et al. Modeling Superscalar Processor Memory-Level Parallelism , 2018, IEEE Computer Architecture Letters.
[52] Mahmut T. Kandemir,et al. FLOSS: FLOw Sensitive Scheduling on Mobile Platforms , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).
[53] Mor Harchol-Balter,et al. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[54] Mahmut T. Kandemir,et al. Memory Row Reuse Distance and its Role in Optimizing Application Performance , 2015, SIGMETRICS 2015.
[55] Onur Mutlu,et al. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.
[56] Olivier Temam,et al. A quantitative analysis of loop nest locality , 1996, ASPLOS VII.
[57] Dam Sunwoo,et al. Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[58] Anoop Gupta,et al. Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.