Optimal Application Mapping and Scheduling for Network-on-Chips with Computation in STT-RAM Based Router

Spin-Torque Transfer Magnetic RAM (STT-RAM), one of the emerging nonvolatile memory (NVM) technologies explored as the replacement for SRAM memory architectures, is particularly promising due to the fast access speed, high integration density, and zero standby power consumption. Recently, hybrid deigns with SRAM and STT-RAM buffers for routers in Network-on-Chip (NoC) systems have been widely implemented to maximize the mutually complementary characteristics of different memory technologies, and leverage the efficiency of intra-router latency and system power consumption. With the realization of Processing-in-Memory enabled by STT-RAM, in this paper, we novelly offload the execution from processors to the STT-RAM based on-chip routers to improve the application performance. On top of the hybrid buffer design in routers, we further present system-level approaches, including an ILP model and polynomial-time heuristic algorithms, to fine-tune the application mapping and scheduling on NoCs, with the objectives of improving system performance-energy efficiency. Network overhead caused by flit conflict in conventional communication circumstances can be ideally avoided by computing the contended flits in intermediate routers; meanwhile, the pressure of heavy workload on processors can be relieved by transferring partial operations to routers, such that network latency and system power consumption can be significantly reduced. Experimental results demonstrate that application schedule length and system energy consumption can be reduced by 35.62, 32.87 percent on average, respectively, in extensive evaluation experiments on PARSEC benchmark applications. In particular, the achievements of application performance and energy efficiency, averagely 36.44 and 33.19 percent, for the CNN application AlexNet have verified the practicability and effectiveness of our presented approaches.

[1]  Yuan Xie,et al.  Hybrid Drowsy SRAM and STT-RAM Buffer Designs for Dark-Silicon-Aware NoC , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Yuankun Xue,et al.  User Cooperation Network Coding Approach for NoC Performance Improvement , 2015, NOCS.

[4]  Turbo Majumder,et al.  NoC router using STT-MRAM based hybrid buffers with error correction and limited flit retransmission , 2015, 2015 IEEE International Symposium on Circuits and Systems (ISCAS).

[5]  Cong Xu,et al.  Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[6]  Engin Ipek,et al.  Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing , 2010, ISCA.

[7]  Rajesh Gupta,et al.  Network topology exploration of mesh-based coarse-grain reconfigurable architectures , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[8]  Mircea R. Stan,et al.  Relaxing non-volatility for fast and energy-efficient STT-RAM caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[9]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[10]  Luan Tran,et al.  45nm low power CMOS logic compatible embedded STT MRAM utilizing a reverse-connection 1T/1MTJ cell , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[11]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[12]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[13]  Radu Marculescu,et al.  An efficient Network-on-Chip (NoC) based multicore platform for hierarchical parallel genetic algorithms , 2014, 2014 Eighth IEEE/ACM International Symposium on Networks-on-Chip (NoCS).

[14]  Ki Hwan Yum,et al.  A Hybrid Buffer Design with STT-MRAM for On-Chip Interconnects , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[15]  Jason Cong,et al.  Minimizing Computation in Convolutional Neural Networks , 2014, ICANN.

[16]  Onur Mutlu,et al.  A case for bufferless routing in on-chip networks , 2009, ISCA '09.

[17]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[18]  Shahin Nazarian,et al.  Prometheus: Processing-in-memory heterogeneous architecture design from a multi-layer network theoretic strategy , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Edwin Hsing-Mean Sha,et al.  Optimal functional unit assignment and voltage selection for pipelined MPSoC with guaranteed probability on time performance , 2017, LCTES.

[20]  Mohamed El-Sayed Ragab,et al.  Flexible router architecture for network-on-chip , 2012, Comput. Math. Appl..

[21]  An-Yeu Wu,et al.  Path-Congestion-Aware Adaptive Routing With a Contention Prediction Scheme for Network-on-Chip Systems , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[22]  Chao Chen,et al.  Hardware-software collaboration for dark silicon heterogeneous many-core systems , 2017, Future Gener. Comput. Syst..

[23]  Rami G. Melhem,et al.  Domain-wall memory buffer for low-energy NoCs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[24]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25]  Wei Zhang,et al.  Distributed Sensor Network-on-Chip for Performance Optimization of Soft-Error-Tolerant Multiprocessor System-on-Chip , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[26]  Wei Zhang,et al.  Thermal-Aware Task Mapping on Dynamically Reconfigurable Network-on-Chip Based Multiprocessor System-on-Chip , 2018, IEEE Transactions on Computers.

[27]  Yuankun Xue,et al.  Improving NoC performance under spatio-temporal variability by runtime reconfiguration: a general mathematical framework , 2016, 2016 Tenth IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[28]  Edwin Hsing-Mean Sha,et al.  Optimal functional-unit assignment and buffer placement for probabilistic pipelines , 2016, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[29]  George Michelogiannakis,et al.  Elastic-buffer flow control for on-chip networks , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[30]  Anand Raghunathan,et al.  Computing in Memory With Spin-Transfer Torque Magnetic RAM , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[31]  Edwin Hsing-Mean Sha,et al.  FoToNoC: A Folded Torus-Like Network-on-Chip Based Many-Core Systems-on-Chip in the Dark Silicon Era , 2017, IEEE Transactions on Parallel and Distributed Systems.

[32]  Wenqing Wu,et al.  Multi retention level STT-RAM cache designs with a dynamic refresh scheme , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Huawei Li,et al.  ProPRAM: Exploiting the transparent logic resources in Non-Volatile Memory for Near Data Computing , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[34]  Shahin Nazarian,et al.  A load balancing inspired optimization framework for exascale multicore systems: A complex networks approach , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[35]  Wei Zhang,et al.  Traffic-Aware Application Mapping for Network-on-Chip Based Multiprocessor System-on-Chip , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[36]  Marios C. Papaefthymiou,et al.  Computational sprinting , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[37]  Seung H. Kang,et al.  A 45nm 1Mb embedded STT-MRAM with design techniques to minimize read-disturbance , 2011, 2011 Symposium on VLSI Circuits - Digest of Technical Papers.

[38]  Lei Zhou,et al.  Optimal Functional-Unit Assignment for Heterogeneous Systems Under Timing Constraint , 2017, IEEE Transactions on Parallel and Distributed Systems.

[39]  Edwin Hsing-Mean Sha,et al.  Application Mapping and Scheduling for Network-on-Chip-Based Multiprocessor System-on-Chip With Fine-Grain Communication Optimization , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[40]  Chita R. Das,et al.  Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[41]  Peng Chen,et al.  Task mapping on SMART NoC: Contention matters, not the distance , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[42]  Edwin Hsing-Mean Sha,et al.  On the Design of Minimal-Cost Pipeline Systems Satisfying Hard/Soft Real-Time Constraints , 2021, IEEE Transactions on Emerging Topics in Computing.

[43]  Nectarios Koziris,et al.  An efficient algorithm for the physical mapping of clustered task graphs onto multiprocessor architectures , 2000, Proceedings 8th Euromicro Workshop on Parallel and Distributed Processing.

[44]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .