论文信息 - Towards achieving predictable memory performance on multi-core based mixed criticality embedded systems

Towards achieving predictable memory performance on multi-core based mixed criticality embedded systems

The reduced space, weight and power(SWaP) characteristics of multi-core systems has motivated the real-time systems research community to explore them in mixedcriticality embedded systems. However, the major challenge in mixed-criticality embedded systems is to provide determinism to real-time tasks in the presence of corunning non-real-time tasks. The shared resources such as Memory subsystem (caches and DRAM) and, Bus make this a challenging effort. This work focuses on the shared Memory subsystem. First, we studied Commercial-Off-The-Shelf (COTS) DRAM controllers to an extend to demonstrate the interference due to shared resources inside DRAM Controllers, its impact on predictability and, proposed a DRAM controller design, called MEDUSA, to provide predictable memory performance in multicore based real-time systems. MEDUSA can provide high time predictability when needed for real-time tasks but also strive to provide high average performance for non-real-time tasks through a close collaboration between the OS and the DRAM controller. In our approach, the OS partially partitions DRAM banks into two groups: reserved banks and shared banks. The reserved banks are exclusive to each core to provide predictable timing while the shared banks are shared by all cores to efficiently utilize the resources. MEDUSA has two separate queues for read and write requests, and it prioritizes reads over writes. In processing read requests, MEDUSA employs a two-level scheduling algorithm that prioritizes the memory requests to the reserved banks in a Round Robin fashion to provide strong timing predictability. In processing write requests, MEDUSA largely relies on the FR-FCFS (First Ready-First Come First Serve) for high throughput but i makes an immediate switch to read upon arrival of read requests to the reserved banks. We implemented MEDUSA in a cycle-accurate full-system simulator and a Linux kernel and performed experiments using a set of synthetic and SPEC2006 benchmarks to analyze the performance impact of MEDUSA on both real-time and non-real-time tasks. The results show that MEDUSA achieves up to 91% better worst-case performance for real-time tasks while achieving up to 29% throughput improvement for non-real-time tasks. Second, we studied the contention at shared caches and its impact on predictability. We demonstrate that the prevailing cache partition techniques do not necessarily ensure predictable cache performance in modern COTS multicore platforms that use non-blocking caches to exploit memory-level-parallelism (MLP).Through carefully designed experiments using three real COTS multicore platforms (four distinct CPU architectures) and a cycle-accurate full system simulator, we show that special hardware registers in non-blocking caches, known as Miss Status Holding Registers (MSHRs), which track the status of outstanding cache-misses, can be a significant source of contention; we observe up to 21X WCET (Worst-Case-Execution-Time) increase in a real COTS multicore platform due to MSHR contention. We propose a hardware and system software (OS) collaborative approach to efficiently eliminate MSHR contention for multicore real-time systems. Our approach includes a low-cost hardware extension that enables dynamic control of per-core MLP by the OS. Using the hardware extension, the OS scheduler then globally controls each core’s MLP in such a way that eliminates MSHR contention and maximizes overall throughput of the system. We implement the hardware extension in a cycle-accurate full-system simulator and the scheduler modification in Linux 3.14 kernel. We evaluate the effectiveness of our approach using a set of synthetic and macro benchmarks. In a case study, we achieve up to 19% WCET reduction (average: 13%) for a set of EEMBC benchmarks ii compared to a baseline cache partitioning setup.

Prathap Kumar Valsan

[1] Heechul Yun. Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms , 2015 .

[2] Magnus Jahre,et al. A light-weight fairness mechanism for chip multiprocessor memory systems , 2009, CF '09.

[3] Josep Torrellas,et al. Scalable Cache Miss Handling for High Memory-Level Parallelism , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[4] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[5] Ronald G. Dreslinski,et al. Sources of error in full-system simulation , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6] Kees G. W. Goossens,et al. Real-Time Scheduling Using Credit-Controlled Static-Priority Arbitration , 2008, 2008 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications.

[7] Tarek F. Abdelzaher,et al. Delay composition in preemptive and non-preemptive real-time pipelines , 2008, Real-Time Systems.

[8] Heechul Yun,et al. Taming Non-Blocking Caches to Improve Isolation in Multicore Real-Time Systems , 2016, 2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).

[9] Heechul Yun,et al. MEDUSA: A Predictable and High-Performance DRAM Controller for Multicore Based Embedded Systems , 2015, 2015 IEEE 3rd International Conference on Cyber-Physical Systems, Networks, and Applications.

[10] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[11] Björn Andersson,et al. Bounding memory interference delay in COTS-based multi-core systems , 2014, 2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[12] Magnus Jahre,et al. A High Performance Adaptive Miss Handling Architecture for Chip Multiprocessors , 2011, Trans. High Perform. Embed. Archit. Compil..

[13] Wang Yi,et al. Building timing predictable embedded systems , 2014, ACM Trans. Embed. Comput. Syst..

[14] David Broman,et al. A predictable and command-level priority-based DRAM controller for mixed-criticality systems , 2015, 21st IEEE Real-Time and Embedded Technology and Applications Symposium.

[15] Norman P. Jouppi,et al. Staged Reads: Mitigating the impact of DRAM writes on DRAM reads , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[16] Ying Ye,et al. COLORIS: A dynamic cache partitioning system using page coloring , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[17] Zhao Zhang,et al. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[18] Rodolfo Pellizzoni,et al. PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms , 2014, 2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[19] Lothar Thiele,et al. Timing Analysis for TDMA Arbitration in Resource Sharing Systems , 2010, 2010 16th IEEE Real-Time and Embedded Technology and Applications Symposium.

[20] Lothar Thiele,et al. Timing Analysis for Resource Access Interference on Adaptive Resource Arbiters , 2011, 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium.

[21] David Eklov,et al. Bandwidth Bandit: Quantitative characterization of memory contention , 2012, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[22] James H. Anderson,et al. Outstanding Paper Award: Making Shared Caches More Predictable on Multicore Platforms , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[23] Robert I. Davis,et al. Mixed Criticality Systems - A Review , 2015 .

[24] Serge J. Belongie,et al. SD-VBS: The San Diego Vision Benchmark Suite , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[25] Dam Sunwoo,et al. Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[26] Frank Mueller,et al. Providing task isolation via TLB coloring , 2015, 21st IEEE Real-Time and Embedded Technology and Applications Symposium.

[27] Rolf Ernst,et al. Improved DRAM Timing Bounds for Real-Time DRAM Controllers with Read/Write Bundling , 2015, 2015 IEEE Real-Time Systems Symposium.

[28] Edward A. Lee,et al. PRET DRAM controller: Bank privatization for predictability and temporal isolation , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[29] Onur Mutlu,et al. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[30] James H. Anderson,et al. Cache Sharing and Isolation Tradeoffs in Multicore Mixed-Criticality Systems , 2015, 2015 IEEE Real-Time Systems Symposium.

[31] Thomas F. Wenisch,et al. Simulating DRAM controllers for future system architecture exploration , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[32] Selma Saidi,et al. A mixed critical memory controller using bank privatization and fixed priority scheduling , 2014, 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications.

[33] Francisco J. Cazorla,et al. An Analyzable Memory Controller for Hard Real-Time CMPs , 2009, IEEE Embedded Systems Letters.

[34] Lui Sha,et al. MemGuard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms , 2013, 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[35] Björn Andersson,et al. Coordinated Bank and Cache Coloring for Temporal Protection of Memory Accesses , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[36] Rodolfo Pellizzoni,et al. Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems , 2014, 2015 27th Euromicro Conference on Real-Time Systems.

[37] Ragunathan Rajkumar,et al. A Coordinated Approach for Practical OS-Level Cache Management in Multi-core Real-Time Systems , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[38] O. Mutlu,et al. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[39] G. Edward Suh,et al. A new memory monitoring scheme for memory-aware scheduling and partitioning , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[40] Michael Stumm,et al. Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[41] Lei Liu,et al. A software memory partition approach for eliminating bank-level interference in multicore systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[42] Marco Caccamo,et al. Real-time cache management framework for multi-core architectures , 2013, 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[43] Rodolfo Pellizzoni,et al. Worst Case Analysis of DRAM Latency in Multi-requestor Systems , 2013, 2013 IEEE 34th Real-Time Systems Symposium.

[44] S. Vestal. Preemptive Scheduling of Multi-criticality Systems with Varying Degrees of Execution Time Assurance , 2007, RTSS 2007.

[45] Francisco J. Cazorla,et al. A Dual-Criticality Memory Controller (DCmc): Proposal and Evaluation of a Space Case Study , 2014, 2014 IEEE Real-Time Systems Symposium.