ALPHA: A Learning-Enabled High-Performance Network-on-Chip Router Design for Heterogeneous Manycore Architectures

Heterogeneous manycores comprised of CPUs, GPUs and accelerators are putting stringent demands on network-on-chips (NoCs). The NoCs need to support the combined traffic, including both latency-sensitive CPU traffic and throughput-sensitive GPU and accelerator traffic. We study the characteristics of the combined traffic, and observe that (1) the limited injection bandwidth is the main obstacle to throughput improvement, and (2) the latency due to local and global contention accounts for a significant portion of the network latency. We propose a router architecture named ALPHA for heterogeneous manycores. ALPHA introduces two new optimizations: (1) increasing injection bandwidth to improve throughput, and (2) resolving local and global contention to reduce network latency. Specifically, ALPHA increases the injection bandwidth through modifications to injection link, crossbar switch and buffer organization in the injection port of the router; ALPHA identifies the upcoming local contention and resolves it by optimally selecting traffic routes; ALPHA detects and alleviates the global contention by utilizing a supervised learning engine for traffic analysis, prediction, and adjustment. Simulation results using Rodinia benchmark show that ALPHA provides 28% throughput increase, 24% latency reduction, 22% execution time speedup, and 19% energy efficiency improvement, compared to the baseline router.

[1]  William J. Dally,et al.  GOAL: a load-balanced adaptive routing algorithm for torus networks , 2003, ISCA '03.

[2]  Nick McKeown,et al.  The iSLIP scheduling algorithm for input-queued switches , 1999, TNET.

[3]  Pedro López,et al.  A family of mechanisms for congestion control in wormhole networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[4]  Yuan Xie,et al.  Packet Pump: Overcoming Network Bottleneck in On-Chip Interconnects for GPGPUs* , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[5]  Maurizio Palesi,et al.  ProNoC: A low latency network-on-chip based many-core system-on-chip prototyping platform , 2017, Microprocess. Microsystems.

[6]  Xiaola Lin,et al.  The Repetitive Turn Model for Adaptive Routing , 2017, IEEE Transactions on Computers.

[7]  Jinchun Kim,et al.  Bandwidth-efficient on-chip interconnect designs for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[8]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[9]  Natalie D. Enright Jerger,et al.  On-Chip Networks , 2009, On-Chip Networks.

[10]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[11]  Ahmed Louri,et al.  Extending the Power-Efficiency and Performance of Photonic Interconnects for Heterogeneous Multicores with Machine Learning , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  John Kim,et al.  Providing cost-effective on-chip network bandwidth in GPGPUs , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[13]  Chita R. Das,et al.  Aérgia: exploiting packet latency slack in on-chip networks , 2010, ISCA.

[14]  Timothy Mark Pinkston,et al.  Communication-Aware Globally-Coordinated On-Chip Networks , 2012, IEEE Transactions on Parallel and Distributed Systems.

[15]  Scott B. Baden,et al.  Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.

[16]  Ahmed Louri,et al.  Dynamic Voltage and Frequency Scaling in NoCs with Supervised and Reinforcement Learning Techniques , 2019, IEEE Transactions on Computers.

[17]  Radu Marculescu,et al.  SVR-NoC: A performance analysis tool for Network-on-Chips using learning-based support vector regression model , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Chung-Ta King,et al.  TS-Router: On maximizing the Quality-of-Allocation in the On-Chip Network , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[19]  Chen Sun,et al.  DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[20]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[21]  Niraj K. Jha,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[22]  David Blaauw,et al.  VIX: Virtual Input Crossbar for efficient switch allocation , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[23]  Yuankun Xue,et al.  User Cooperation Network Coding Approach for NoC Performance Improvement , 2015, NOCS.

[24]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[25]  Lionel M. Ni,et al.  The turn model for adaptive routing , 1998, ISCA '98.

[26]  Hamid Sarbazi-Azad,et al.  BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems , 2017, 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[27]  Srinivasan Seshan,et al.  On-chip networks from a networking perspective: congestion and scalability in many-core interconnects , 2012, SIGCOMM '12.

[28]  Chita R. Das,et al.  OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  Olivier Temam,et al.  Reconciling specialization and flexibility through compound circuits , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[30]  William J. Dally,et al.  Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.

[31]  José Duato,et al.  Adaptive bubble router: a design to improve performance in torus networks , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[32]  Ahmed Louri,et al.  Machine learning enabled power-aware Network-on-Chip design , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[33]  Chris Fallin,et al.  Next generation on-chip networks: what kind of congestion control do we need? , 2010, Hotnets-IX.

[34]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[35]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[36]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[37]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[38]  Luca Benini,et al.  A multi-path routing strategy with guaranteed in-order packet delivery and fault-tolerance for networks on chip , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[39]  John Kim,et al.  Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[40]  Avinash Kodi,et al.  LEAD: Learning-enabled Energy-Aware Dynamic Voltage/frequency scaling in NoCs , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[41]  Jason Cong,et al.  On-chip interconnection network for accelerator-rich architectures , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[42]  Ahmed Louri,et al.  EZ-Pass: An Energy & Performance-Efficient Power-Gating Router Architecture for Scalable NoCs , 2018, IEEE Computer Architecture Letters.

[43]  Kyung Hoon Kim,et al.  Packet coalescing exploiting data redundancy in GPGPU architectures , 2017, ICS.

[44]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45]  Stephen W. Keckler,et al.  Regional congestion awareness for load balance in networks-on-chip , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[46]  Sudhakar Yalamanchili,et al.  Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures , 2013, ACM Trans. Design Autom. Electr. Syst..

[47]  Shahin Nazarian,et al.  Self-Optimizing and Self-Programming Computing Systems: A Combined Compiler, Complex Networks, and Machine Learning Approach , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[48]  Chita R. Das,et al.  A low latency router supporting adaptivity for on-chip interconnects , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[49]  Niraj K. Jha,et al.  Express virtual channels: towards the ideal interconnection fabric , 2007, ISCA '07.

[50]  Mahmut T. Kandemir,et al.  Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[51]  Yuval Tamir,et al.  Symmetric Crossbar Arbiters for VLSI Communication Switches , 1993, IEEE Trans. Parallel Distributed Syst..

[52]  Ahmed Louri,et al.  Dynamic error mitigation in NoCs using intelligent prediction techniques , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[53]  Ahmed Louri,et al.  A Versatile and Flexible Chiplet-based System Design for Heterogeneous Manycore Architectures , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[54]  Pedro López,et al.  A congestion control mechanism for wormhole networks , 2001, Proceedings Ninth Euromicro Workshop on Parallel and Distributed Processing.

[55]  Chita R. Das,et al.  Application-aware prioritization mechanisms for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[56]  Radu Marculescu,et al.  A traffic-aware adaptive routing algorithm on a highly reconfigurable network-on-chip architecture , 2012, CODES+ISSS.

[57]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[58]  Mark Handley,et al.  Congestion control for high bandwidth-delay product networks , 2002, SIGCOMM '02.

[59]  Natalie D. Enright Jerger,et al.  Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.

[60]  Kevin Kai-Wei Chang,et al.  HAT: Heterogeneous Adaptive Throttling for On-Chip Networks , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[61]  Ahmed Louri,et al.  High-performance, Energy-efficient, Fault-tolerant Network-on-Chip Design Using Reinforcement Learnin , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[62]  Ahmed Louri,et al.  IntelliNoC: A Holistic Design Framework for Energy-Efficient and Reliable On-Chip Communication for Manycores , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[63]  Yuan Yao,et al.  Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[64]  William J. Dally,et al.  Allocator implementations for network-on-chip routers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[65]  S. Lennart Johnsson,et al.  ROMM routing on mesh and torus networks , 1995, SPAA '95.

[66]  Ahmed Louri,et al.  An Approximate Communication Framework for Network-on-Chips , 2020, IEEE Transactions on Parallel and Distributed Systems.

[67]  Ahmed Louri,et al.  An Energy-Efficient Network-on-Chip Design using Reinforcement Learning , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[68]  William J. Dally,et al.  Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..

[69]  William J. Dally,et al.  A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[70]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[71]  John Kim,et al.  Footprint: Regulating routing adaptiveness in Networks-on-Chip , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[72]  William J. Dally,et al.  Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[73]  Akif Ali,et al.  Near-optimal worst-case throughput routing for two-dimensional mesh networks , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[74]  Nan Jiang,et al.  Packet chaining: Efficient single-cycle allocation for on-chip networks , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[75]  Yoon Seok Yang,et al.  SDPR: Improving Latency and Bandwidth in On-Chip Interconnect Through Simultaneous Dual-Path Routing , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[76]  Xi Chen,et al.  Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[77]  Chita R. Das,et al.  A heterogeneous multiple network-on-chip design: An application-aware approach , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[78]  James C. Hoe,et al.  Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[79]  Sally Floyd,et al.  TCP and explicit congestion notification , 1994, CCRV.