论文信息 - Enhancing the Performance and Fairness of Shared DRAM Systems with Parallelism-Aware Batch Scheduling

Enhancing the Performance and Fairness of Shared DRAM Systems with Parallelism-Aware Batch Scheduling

Enhancing the Performance and Fairness of Shared DRAM Systems with Parallelism-Aware Batch Scheduling Onur Mutlu Thomas Moscibroda Microsoft Research Abstract In a chip-multiprocessor (CMP) system, the DRAM system is shared among cores. In a shared DRAM system, requests from a thread can not only delay requests from other threads by causing bank/bus/row-buffer conflicts but they can also destroy other threads’ DRAM-bank-level parallelism. Requests whose latencies would otherwise have been overlapped could effectively become serialized. As a result both fairness and system throughput degrade, and some threads can starve for long time periods. This paper proposes a fundamentally new approach to designing a shared DRAM controller that provides quality of service to threads, while also improving system throughput. Our parallelism-aware batch scheduler (PAR-BS) design is based on two key ideas. First, PAR-BS processes DRAM requests in batches to provide fairness and to avoid starvation of requests. Second, to optimize system throughput, PAR-BS employs a parallelism-aware DRAM scheduling policy that aims to process requests from a thread in parallel in the DRAM banks, thereby reducing the memory-related stall-time experienced by the thread. PAR-BS seamlessly incorporates support for system-level thread priorities and can provide different service levels, including purely opportunistic service, to threads with different priorities. We evaluate the design trade-offs involved in PAR-BS and compare it to four previously proposed DRAM scheduler designs on 4-, 8-, and 16-core systems. Our evaluations show that, averaged over 100 4-core workloads, PAR-BS improves fairness by 1.11X and system throughput by 8.3% compared to the best previous scheduling technique, Stall-Time Fair Memory (STFM) scheduling. Based on simple request prioritization rules, PAR-BS is also simpler to implement than STFM.

Onur Mutlu | Thomas Moscibroda | O. Mutlu | T. Moscibroda

[1] Faye A. Briggs,et al. A study of performance impact of memory controller features in multi-processor server environment , 2004, WMPI '04.

[2] Yan Solihin,et al. QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.

[3] Sanjay Bhansali,et al. Framework for instruction-level tracing and analysis of program executions , 2006, VEE '06.

[4] Won-Taek Lim,et al. Effective Management of DRAM Bandwidth in Multicore Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[5] Onur Mutlu,et al. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6] Trevor N. Mudge,et al. A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[7] John W. Lockwood,et al. Beyond performance: secure and fair memory management for multiple systems on a chip , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[8] Onur Mutlu,et al. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[9] M TullsenDean,et al. Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[10] G. Edward Suh,et al. A new memory monitoring scheme for memory-aware scheduling and partitioning , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[11] Brian Fahs,et al. Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12] William Jalby,et al. XOR-Schemes: A Flexible Data Organization in Parallel Memories , 1985, ICPP.

[13] James E. Smith,et al. Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[14] Rajiv Kapoor,et al. Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[15] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[16] Sally A. McKee,et al. Dynamic Access Ordering for Streamed Computations , 2000, IEEE Trans. Computers.

[17] E SmithJames,et al. Implementation of precise interrupts in pipelined processors , 1985 .

[18] Howard Frank,et al. Analysis and Optimization of Disk Storage Devices for Time-Sharing Systems , 1969, JACM.

[19] Toby J. Teorey,et al. A comparative analysis of disk scheduling policies , 1972, CACM.

[20] Calvin Lin,et al. Adaptive History-Based Memory Schedulers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[21] Wayne E. Smith. Various optimizers for single‐stage production , 1956 .

[22] John Wilkes,et al. Disk scheduling algorithms based on rotational position , 1991 .

[23] R. M. Tomasulo,et al. An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[24] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[25] Jun Shao,et al. A Burst Scheduling Access Reordering Mechanism , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[26] Zhao Zhang,et al. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[27] Hsien-Hsin S. Lee,et al. Analyzing Performance Vulnerability due to Resource Denial›of›Service Attack on Chip Multiprocessors , 2007 .

[28] Zhao Zhang,et al. A performance comparison of DRAM memory system optimizations for SMT processors , 2005, 11th International Symposium on High-Performance Computer Architecture.

[29] Yan Solihin,et al. Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[30] Tejas Karkhanis,et al. A Day in the Life of a Data Cache Miss , 2002 .

[31] Onur Mutlu,et al. A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[32] Manoj Franklin,et al. Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[33] Trevor Mudge,et al. Modern dram architectures , 2001 .

[34] Trevor Mudge,et al. Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[35] Onur Mutlu,et al. Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance , 2006, IEEE Micro.

[36] Andrew R. Pleszkun,et al. Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[37] Avi Mendelson,et al. Fairness and Throughput in Switch on Event Multithreading , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[38] Toby J. Teorey,et al. A comparative analysis of disk scheduling policies , 1972, CACM.

[39] Chein-Wei Jen,et al. Quality-aware memory controller for multimedia platform SoC , 2003, 2003 IEEE Workshop on Signal Processing Systems (IEEE Cat. No.03TH8682).

[40] Scott Rixner,et al. Memory Controller Optimizations for Web Servers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[41] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[42] Dean M. Tullsen,et al. Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.