Co-optimizing memory-level parallelism and cache-level parallelism

Minimizing cache misses has been the traditional goal in optimizing cache performance using compiler based techniques. However, continuously increasing dataset sizes combined with large numbers of cache banks and memory banks connected using on-chip networks in emerging manycores/accelerators makes cache hit–miss latency optimization as important as cache miss rate minimization. In this paper, we propose compiler support that optimizes both the latencies of last-level cache (LLC) hits and the latencies of LLC misses. Our approach tries to achieve this goal by improving the parallelism exhibited by LLC hits and LLC misses. More specifically, it tries to maximize both cache-level parallelism (CLP) and memory-level parallelism (MLP). This paper presents different incarnations of our approach, and evaluates them using a set of 12 multithreaded applications. Our results indicate that (i) optimizing MLP first and CLP later brings, on average, 11.31% performance improvement over an approach that already minimizes the number of LLC misses, and (ii) optimizing CLP first and MLP later brings 9.43% performance improvement. In comparison, balancing MLP and CLP brings 17.32% performance improvement on average.

[1]  Mahmut T. Kandemir,et al.  Quantifying and Optimizing Data Access Parallelism on Manycores , 2018, 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[2]  Mahmut T. Kandemir,et al.  Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[3]  Gerda Janssens,et al.  Multi-dimensional incremental loop fusion for data locality , 2003, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003.

[4]  Mahmut T. Kandemir,et al.  Race-To-Sleep + Content Caching + Display Caching: A Recipe for Energy-efficient Video Streaming on Handhelds , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Mahmut T. Kandemir,et al.  Data Movement Aware Computation Partitioning , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[7]  Mahmut T. Kandemir,et al.  Addressing End-to-End Memory Access Latency in NoC-Based Multicores , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  Calvin Lin,et al.  Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Mahmut T. Kandemir,et al.  POSTER: Location-Aware Computation Mapping for Manycore Processors , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[11]  Mahmut T. Kandemir,et al.  Meeting midway: Improving CMP performance with memory-side prefetching , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[12]  Onur Mutlu,et al.  Improving memory Bank-Level Parallelism in the presence of prefetching , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Calvin Lin,et al.  Adaptive History-Based Memory Schedulers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[14]  Josep Torrellas,et al.  PageForge: A Near-Memory Content-Aware Page-Merging Architecture , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[16]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[17]  Mary W. Hall,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[18]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[19]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[20]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[21]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[22]  Mahmut T. Kandemir,et al.  Optimizing off-chip accesses in multicores , 2015, PLDI.

[23]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[24]  Michael F. P. O'Boyle,et al.  Integrating Loop and Data Transformations for Global Optimization , 2002, J. Parallel Distributed Comput..

[25]  Mahmut T. Kandemir,et al.  A Matrix-Based Approach to Global Locality Optimization , 1999, J. Parallel Distributed Comput..

[26]  Mahmut T. Kandemir,et al.  Quantifying Data Locality in Dynamic Parallelism in GPUs , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[27]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[28]  Mahmut T. Kandemir,et al.  Opportunistic Computing in GPU Architectures , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[29]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[30]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[31]  Mahmut T. Kandemir,et al.  Improving bank-level parallelism for irregular applications , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[33]  Stijn Eyerman,et al.  A Memory-Level Parallelism Aware Fetch Policy for SMT Processors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[34]  Sarita V. Adve,et al.  Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[35]  Monica S. Lam,et al.  Array-data flow analysis and its use in array privatization , 1993, POPL '93.

[36]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[37]  M. Kandemir,et al.  Computing with Near Data , 2019, SIGMETRICS.

[38]  Mahmut T. Kandemir,et al.  Compiler Support for Optimizing Memory Bank-Level Parallelism , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[39]  Mateo Valero,et al.  Static locality analysis for cache management , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[40]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[41]  Mahmut T. Kandemir,et al.  A Layout-Conscious Iteration Space Transformation Technique , 2001, IEEE Trans. Computers.

[42]  Onur Mutlu,et al.  Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, ISCA.

[43]  Uday Bondhugula,et al.  PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System , 2015 .

[44]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[45]  Mahmut T. Kandemir,et al.  DEMM: A Dynamic Energy-Saving Mechanism for Multicore Memories , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[46]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[47]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[48]  Olivier Temam,et al.  Data caches for superscalar processors , 1997, ICS '97.

[49]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[50]  Mahmut T. Kandemir,et al.  Enhancing computation-to-core assignment with physical location information , 2018, PLDI.

[51]  Lieven Eeckhout,et al.  Modeling Superscalar Processor Memory-Level Parallelism , 2018, IEEE Computer Architecture Letters.

[52]  Mahmut T. Kandemir,et al.  FLOSS: FLOw Sensitive Scheduling on Mobile Platforms , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[53]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[54]  Mahmut T. Kandemir,et al.  Memory Row Reuse Distance and its Role in Optimizing Application Performance , 2015, SIGMETRICS 2015.

[55]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[56]  Olivier Temam,et al.  A quantitative analysis of loop nest locality , 1996, ASPLOS VII.

[57]  Dam Sunwoo,et al.  Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[58]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.