论文信息 - Predictable Performance and Fairness Through Accurate Slowdown Estimation in Shared Main Memory Systems

Predictable Performance and Fairness Through Accurate Slowdown Estimation in Shared Main Memory Systems

This paper summarizes the ideas and key concepts in MISE (Memory Interference-induced Slowdown Estimation), which was published in HPCA 2013 [97], and examines the work's significance and future potential. Applications running concurrently on a multicore system interfere with each other at the main memory. This interference can slow down different applications differently. Accurately estimating the slowdown of each application in such a system can enable mechanisms that can enforce quality-of-service. While much prior work has focused on mitigating the performance degradation due to inter-application interference, there is little work on accurately estimating slowdown of individual applications in a multi-programmed environment. Our goal is to accurately estimate application slowdowns, towards providing predictable performance. To this end, we first build a simple Memory Interference-induced Slowdown Estimation (MISE) model, which accurately estimates slowdowns caused by memory interference. We then leverage our MISE model to develop two new memory scheduling schemes: 1) one that provides soft quality-of-service guarantees, and 2) another that explicitly attempts to minimize maximum slowdown (i.e., unfairness) in the system. Evaluations show that our techniques perform significantly better than state-of-the-art memory scheduling approaches to address the same problems. Our proposed model and techniques have enabled significant research in the development of accurate performance models [35, 59, 98, 110] and interference management mechanisms [66, 99, 100, 108, 119, 120].

[1] William E. Weihl,et al. Lottery scheduling: flexible proportional-share resource management , 1994, OSDI '94.

[2] Garth A. Gibson,et al. Implementing Lottery Scheduling: Matching the Specializations in Traditional Schedulers , 1999, USENIX Annual Technical Conference, General Track.

[3] A. Snavely,et al. Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[4] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[5] Manoj Franklin,et al. Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[6] Mithuna Thottethodi,et al. Self-tuned congestion control for multiprocessor networks , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[7] James E. Smith,et al. A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[8] Ravi R. Iyer,et al. CQoS: a framework for enabling QoS in shared caches of CMP platforms , 2004, ICS '04.

[9] Pedro López,et al. A family of mechanisms for congestion control in wormhole networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[10] James E. Smith,et al. Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11] James E. Smith,et al. A performance counter architecture for computing accurate CPI components , 2006, ASPLOS XII.

[12] Francisco J. Cazorla,et al. Predictable performance in SMT processors: synergy between the OS and SMTs , 2006, IEEE Transactions on Computers.

[13] Onur Mutlu,et al. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[14] Onur Mutlu,et al. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15] Yan Solihin,et al. QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.

[16] Onur Mutlu,et al. Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[17] Onur Mutlu,et al. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[18] M. Breitwisch. Phase Change Memory , 2008, 2008 International Interconnect Technology Conference.

[19] Stijn Eyerman,et al. System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[20] Onur Mutlu,et al. Distributed order scheduling and its application to multi-core dram controllers , 2008, PODC '08.

[21] Onur Mutlu,et al. Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[22] Ramesh Illikkal,et al. Rate-based QoS techniques for cache/memory in CMP platforms , 2009, ICS.

[23] Tao Li,et al. Exploring Phase Change Memory and 3D Die-Stacking for Power/Thermal Friendly, Fast and Durable Memory Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[24] Onur Mutlu,et al. Improving memory Bank-Level Parallelism in the presence of prefetching , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25] YangJun,et al. A durable and energy efficient main memory using phase change memory technology , 2009 .

[26] James E. Smith,et al. Advanced Micro Devices , 2005 .

[27] Onur Mutlu,et al. Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[28] Stijn Eyerman,et al. Per-thread cycle accounting in SMT processors , 2009, ASPLOS.

[29] EeckhoutLieven,et al. Per-thread cycle accounting in SMT processors , 2009 .

[30] Vijayalakshmi Srinivasan,et al. Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[31] Francisco J. Cazorla,et al. CPU Accounting in CMP Processors , 2009, IEEE Computer Architecture Letters.

[32] Chita R. Das,et al. Application-aware prioritization mechanisms for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33] Tor M. Aamodt,et al. Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34] Onur Mutlu,et al. Phase change memory architecture and the quest for scalability , 2010, Commun. ACM.

[35] Jun Yang,et al. Phase-Change Technology and the Future of Main Memory , 2010, IEEE Micro.

[36] Onur Mutlu,et al. DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems , 2010 .

[37] Mor Harchol-Balter,et al. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[38] Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[39] Mor Harchol-Balter,et al. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[40] Alexandra Fedorova,et al. Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[41] Stijn Eyerman,et al. Per-Thread Cycle Accounting , 2010, IEEE Micro.

[42] Chris Fallin,et al. Next generation on-chip networks: what kind of congestion control do we need? , 2010, Hotnets-IX.

[43] Onur Mutlu,et al. Prefetch-Aware Memory Controllers , 2011, IEEE Transactions on Computers.

[44] Sai Prashanth Muralidhara,et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45] R. Balasubramonian,et al. Refining the Utility Metric for Utility-Based Cache Partitioning ∗ , 2011 .

[46] Lingjia Tang,et al. The impact of memory subsystem resource sharing on datacenter applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[47] S. Phadke,et al. MLP aware heterogeneous memory system , 2011, 2011 Design, Automation & Test in Europe.

[48] Kevin Skadron,et al. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[49] Chris Fallin,et al. Parallel application memory scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[50] Ricardo Bianchini,et al. Page placement in hybrid memory systems , 2011, ICS '11.

[51] David Eklov,et al. Cache Pirating: Measuring the Curse of the Shared Cache , 2011, 2011 International Conference on Parallel Processing.

[52] Onur Mutlu,et al. Prefetch-aware shared-resource management for multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[53] Lizy Kurian John,et al. Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[54] Kevin Kai-Wei Chang,et al. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[55] Lieven Eeckhout,et al. Scheduling heterogeneous multi-cores through performance impact estimation (PIE) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[56] Srinivasan Seshan,et al. On-chip networks from a networking perspective: congestion and scalability in many-core interconnects , 2012, SIGCOMM '12.

[57] Richard Veras,et al. RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[58] SeshanSrinivasan,et al. On-chip networks from a networking perspective , 2012 .

[59] Zhen Fang,et al. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[60] Onur Mutlu,et al. Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management , 2012, IEEE Computer Architecture Letters.

[61] Kevin Kai-Wei Chang,et al. HAT: Heterogeneous Adaptive Throttling for On-Chip Networks , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[62] Dam Sunwoo,et al. Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[63] Lei Liu,et al. A software memory partition approach for eliminating bank-level interference in multicore systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[64] Onur Mutlu,et al. The evicted-address filter: A unified mechanism to address both cache pollution and thrashing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[65] Rachata Ausavarungnirun,et al. Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[66] Onur Mutlu,et al. A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[67] Onur Mutlu,et al. MISE: Providing performance predictability and improving fairness in shared main memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[68] Stijn Eyerman,et al. Per-thread cycle accounting in multicore processors , 2013, TACO.

[69] David Eklov,et al. Bandwidth Bandit: Quantitative characterization of memory contention , 2012, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[70] Reetuparna Das,et al. Application-to-core mapping policies to reduce memory system interference in multi-core systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[71] Onur Mutlu,et al. A Case for Effic ient Hardware/Soft ware Cooperative Management of Storage and Memory , 2013 .

[72] Onur Mutlu,et al. An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms , 2013, ISCA.

[73] David Black-Schaffer,et al. Modeling performance variation due to cache sharing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[74] Onur Mutlu,et al. Tiered-latency DRAM: A low latency and low cost DRAM architecture , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[75] Lingjia Tang,et al. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[76] José F. Martínez,et al. Improving memory scheduling via processor-side load criticality information , 2013, ISCA.

[77] Rachata Ausavarungnirun,et al. RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[78] Onur Mutlu,et al. Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[79] Mahmut T. Kandemir,et al. Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[80] Onur Mutlu,et al. Improving DRAM performance by parallelizing refreshes with accesses , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[81] Mahmut T. Kandemir,et al. Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[82] Onur Mutlu,et al. The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[83] Onur Mutlu,et al. FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[84] Chris Fallin,et al. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[85] Onur Mutlu,et al. Research Problems and Opportunities in Memory Systems , 2014, Supercomput. Front. Innov..

[86] Xu Cheng,et al. Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[87] Pavan Balaji,et al. Toward the efficient use of multiple explicitly managed memory subsystems , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[88] Onur Mutlu,et al. Adaptive-latency DRAM: Optimizing DRAM timing for the common-case , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[89] Onur Mutlu,et al. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[90] Jongmoo Choi,et al. Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[91] Hui Wang,et al. A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters , 2015, VEE 2015.

[92] Karsten Schwan,et al. Data tiering in heterogeneous memory systems , 2016, EuroSys.

[93] Onur Mutlu,et al. Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[94] Rami G. Melhem,et al. Concurrent Migration of Multiple Pages in software-managed hybrid main memory , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[95] Onur Mutlu,et al. ChargeCache: Reducing DRAM latency by exploiting row access locality , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[96] Onur Mutlu,et al. BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling , 2016, IEEE Transactions on Parallel and Distributed Systems.

[97] Keke Gai,et al. Smart Energy-Aware Data Allocation for Heterogeneous Memory , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[98] Kevin Kai-Wei Chang,et al. DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators , 2016, ACM Trans. Archit. Code Optim..

[99] Onur Mutlu,et al. Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[100] David Wentzlaff,et al. MITTS: Memory Inter-arrival Time Traffic Shaping , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[101] Yuan Yao,et al. Aggregate Flow-Based Performance Fairness in CMPs , 2016, ACM Trans. Archit. Code Optim..

[102] Lei Liu,et al. Memos: A full hierarchy hybrid memory management framework , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[103] Onur Mutlu,et al. Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization , 2016, SIGMETRICS.

[104] Onur Mutlu,et al. Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[105] Onur Mutlu,et al. SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[106] Onur Mutlu,et al. The reach profiler (REAPER): Enabling the mitigation of DRAM retention failures via profiling at aggressive conditions , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[107] Onur Mutlu,et al. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[108] Rachata Ausavarungnirun,et al. Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms , 2017, SIGMETRICS.

[109] Srinivas Devadas,et al. Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[110] Onur Mutlu,et al. Carpool: a bufferless on-chip network supporting adaptive multicast and hotspot alleviation , 2017, ICS.

[111] Jin Sun,et al. Utility-Based Hybrid Memory Management , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[112] Xiaolang Yan,et al. Providing Predictable Performance via a Slowdown Estimation Model , 2017, ACM Trans. Archit. Code Optim..

[113] Onur Mutlu,et al. Understanding Reduced-Voltage Operation in Modern DRAM Devices , 2017, Proc. ACM Meas. Anal. Comput. Syst..

[114] Thomas F. Wenisch,et al. Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.

[115] Lieven Eeckhout,et al. GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).