Leveraging on Deep Memory Hierarchies to Minimize Energy Consumption and Data Access Latency on Single-Chip Cloud Computers

Recent advances in chip design and integration technologies have led to the development of Single-Chip Cloud computers which are a microcosm of cloud datacenters. Those computers are based on Network-on-Chip (NoC) architectures with deep memory hierarchies. Developing scheduling algorithms to reduce data access latency as well as energy consumption is a major challenge for such architectures. In this paper, we propose a set of algorithms to jointly address the problem of task scheduling and data allocation in a unified approach. Moreover, we present a feasible system model for NoC based multicores considering a three-level memory hierarchy that effectively captures the energy consumed by various elements of system including: processing cores, caches, and NoC subsystem. Simulation results show the superiority of proposed algorithms compared to two state-of-the-art algorithms found in the literature. The experimental results clearly indicate that algorithms performing data and task scheduling in a joint fashion are superior against techniques implementing task and data scheduling separately.

[1]  José González,et al.  Distributed Cooperative Caching: An Energy Efficient Memory Scheme for Chip Multiprocessors , 2012, IEEE Transactions on Parallel and Distributed Systems.

[2]  Shaik Mahmed A 16-Core Processor with Shared-Memory and Message-Passing Communications , 2015 .

[3]  Hong He,et al.  Task assignment in heterogeneous computing systems using an effective iterated greedy algorithm , 2011, J. Syst. Softw..

[4]  Xu Cheng,et al.  A 16-Core Processor With Shared-Memory and Message-Passing Communications , 2014, IEEE Transactions on Circuits and Systems I: Regular Papers.

[5]  Wei-Che Tseng,et al.  Data Allocation Optimization for Hybrid Scratch Pad Memory With SRAM and Nonvolatile Memory , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  Valentin Deaconu Directed Graphs , 2010, Encyclopedia of Machine Learning.

[7]  Edwin Hsing-Mean Sha,et al.  Efficient assignment and scheduling for heterogeneous DSP systems , 2005, IEEE Transactions on Parallel and Distributed Systems.

[8]  N. Muralimanohar,et al.  CACTI 6 . 0 : A Tool to Understand Large Caches , 2007 .

[9]  Wayne H. Wolf,et al.  TGFF: task graphs for free , 1998, Proceedings of the Sixth International Workshop on Hardware/Software Codesign. (CODES/CASHE'98).

[10]  Vivek Sarkar,et al.  Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement , 2009, LCPC.

[11]  Quan Chen,et al.  Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures , 2013, IEEE Transactions on Parallel and Distributed Systems.

[12]  Ajoy Kumar Datta,et al.  CPU Scheduling for Power/Energy Management on Multicore Processors Using Cache Miss and Context Switch Data , 2014, IEEE Transactions on Parallel and Distributed Systems.

[13]  Tulika Mitra,et al.  Integrated scratchpad memory optimization and task scheduling for MPSoC architectures , 2006, CASES '06.

[14]  Karthick Rajamani,et al.  Tiered Memory: An Iso-Power Memory Architecture to Address the Memory Power Wall , 2012, IEEE Transactions on Computers.

[15]  Kenli Li,et al.  Energy-Aware Data Allocation and Task Scheduling on Heterogeneous Multiprocessor Systems With Time Constraints , 2014, IEEE Transactions on Emerging Topics in Computing.

[16]  César A. M. Marcon,et al.  Partitioning and mapping on NoC-Based MPSoC: an energy consumption saving approach , 2011, NoCArc '11.

[17]  Lothar Thiele,et al.  Dynamic Power-Aware Mapping of Applications onto Heterogeneous MPSoC Platforms , 2010, IEEE Transactions on Industrial Informatics.

[18]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[19]  Meikang Qiu,et al.  Data Placement and Duplication for Embedded Multicore Systems With Scratch Pad Memory , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[20]  Wei Zhang,et al.  Hybrid SPM-cache architectures to achieve high time predictability and performance , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[21]  David Daly,et al.  The cache and memory subsystems of the IBM POWER8 processor , 2015, IBM J. Res. Dev..

[22]  Yi He,et al.  Co-optimization of memory access and task scheduling on MPSoC architectures with multi-level memory , 2010, 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC).

[23]  Nikil D. Dutt,et al.  NoC-based fault-tolerant cache design in chip multiprocessors , 2014, ACM Trans. Embed. Comput. Syst..

[24]  Jean-Luc Dekeyser,et al.  Estimating Energy Consumption for an MPSoC Architectural Exploration , 2006, ARCS.

[25]  Nikil D. Dutt,et al.  HaVOC: A hybrid memory-aware virtualization layer for on-chip distributed ScratchPad and Non-Volatile Memories , 2012, DAC Design Automation Conference 2012.

[26]  Wei-Che Tseng,et al.  Minimizing Access Cost for Multiple Types of Memory Units in Embedded Systems Through Data Allocation and Scheduling , 2012, IEEE Transactions on Signal Processing.

[27]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[28]  Nikil D. Dutt,et al.  A novel NoC-based design for fault-tolerance of last-level caches in CMPs , 2012, CODES+ISSS '12.

[29]  Hai Jin,et al.  DAGMap: efficient and dependable scheduling of DAG workflow job in Grid , 2010, The Journal of Supercomputing.

[30]  Naehyuck Chang,et al.  System-Level Performance and Power Optimization for MPSoC , 2015, ACM Trans. Embed. Comput. Syst..

[31]  Christoforos E. Kozyrakis,et al.  Towards energy-proportional datacenter memory with mobile DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[32]  Mani Azimi,et al.  Integration Challenges and Tradeoffs for Terascale Architectures , 2007 .

[33]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[34]  Partha Pratim Pande,et al.  Performance evaluation and design trade-offs for wireless network-on-chip architectures , 2012, JETC.