Taming Extreme Heterogeneity via Machine Learning based Design of Autonomous Manycore Systems

To avoid rewriting software code for new computer architectures and to take advantage of the extreme heterogeneous processing, communication and storage technologies, there is an urgent need for determining the right amount and type of specialization while making a heterogeneous system as programmable and flexible as possible. To enable both programmability and flexibility in the heterogeneous computing era, we propose a novel complex network inspired model of computation and efficient optimization algorithms for determining the optimal degree of parallelization from old software code. This mathematical framework allows us to determine the required number and type of processing elements, the amount and type of deep memory hierarchy, and the degree of reconfiguration for the communication infrastructure, thus opening new avenues to performance and energy efficiency. Our framework enables heterogeneous manycore systems to autonomously adapt from traditional switching techniques to network coding strategies in order to sustain on-chip communication in the order of terabytes. While this new programming model enables the design of self-programmable autonomous heterogeneous manycore systems, a number of open challenges will be discussed.

[1]  Shahin Nazarian,et al.  Self-Optimizing and Self-Programming Computing Systems: A Combined Compiler, Complex Networks, and Machine Learning Approach , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  Yuankun Xue,et al.  Reconstructing missing complex networks against adversarial interventions , 2019, Nature Communications.

[3]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[4]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[5]  Radu Marculescu,et al.  Learning-Based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems , 2018, IEEE Transactions on Computers.

[6]  Yuankun Xue,et al.  Scalable and realistic benchmark synthesis for efficient NoC performance evaluation: A complex network analysis approach , 2016, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[7]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[8]  Shahin Nazarian,et al.  Prometheus: Processing-in-memory heterogeneous architecture design from a multi-layer network theoretic strategy , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  Partha Pratim Pande,et al.  Monolithic 3D-Enabled High Performance and Energy Efficient Network-on-Chip , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[10]  R. Jordan,et al.  NVM neuromorphic core with 64k-cell (256-by-256) phase change memory synaptic array with on-chip neuron circuits for continuous in-situ learning , 2015, 2015 IEEE International Electron Devices Meeting (IEDM).

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Paul Bogdan,et al.  Ollivier-Ricci Curvature-Based Method to Community Detection in Complex Networks , 2019, Scientific Reports.

[13]  Hao Yu,et al.  Energy efficient in-memory machine learning for data intensive image-processing by non-volatile domain-wall memory , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[14]  Partha Pratim Pande,et al.  Machine Learning for Design Space Exploration and Optimization of Manycore Systems , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[15]  Hao Jiang,et al.  RENO: A high-efficient reconfigurable neuromorphic computing accelerator design , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[16]  Radu Marculescu,et al.  Imitation Learning for Dynamic VFI Control in Large-Scale Manycore Systems , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[17]  Carole-Jean Wu,et al.  Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[18]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[19]  Partha Pratim Pande,et al.  REGENT: A Heterogeneous ReRAM/GPU-based Architecture Enabled by NoC for Training CNNs , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[20]  Axel Jantsch,et al.  The Benefits of Self-Awareness and Attention in Fog and Mist Computing , 2015, Computer.

[21]  Yuankun Xue,et al.  User Cooperation Network Coding Approach for NoC Performance Improvement , 2015, NOCS.

[22]  Partha Pratim Pande,et al.  Impact of Electrostatic Coupling on Monolithic 3D-enabled Network on Chip , 2019, ACM Trans. Design Autom. Electr. Syst..

[23]  Umit Y. Ogras,et al.  Dynamic Resource Management of Heterogeneous Mobile Platforms via Imitation Learning , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Ji Li,et al.  Fundamental Challenges Toward Making the IoT a Reachable Reality , 2017, ACM Trans. Design Autom. Electr. Syst..

[25]  Partha Pratim Pande,et al.  Optimizing 3D NoC design for energy efficiency: A machine learning approach , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[26]  Yiran Chen,et al.  GraphR: Accelerating Graph Processing Using ReRAM , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27]  Zbigniew J. Czech,et al.  Introduction to Parallel Computing , 2017 .

[28]  Yiran Chen,et al.  ZARA: A Novel Zero-free Dataflow Accelerator for Generative Adversarial Networks in 3D ReRAM , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[29]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[30]  Radu Marculescu,et al.  Hybrid On-Chip Communication Architectures for Heterogeneous Manycore Systems , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[31]  Catherine Graves,et al.  Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[32]  Tianshi Chen,et al.  DaDianNao: A Neural Network Supercomputer , 2017, IEEE Transactions on Computers.

[33]  Kaushik Roy,et al.  SPINDLE: SPINtronic Deep Learning Engine for large-scale neuromorphic computing , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[34]  Axel Jantsch,et al.  Toward Smart Embedded Systems , 2016, ACM Trans. Embed. Comput. Syst..

[35]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[36]  Martine D. F. Schlag,et al.  Spectral K-way ratio-cut partitioning and clustering , 1994, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[37]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[38]  Partha Pratim Pande,et al.  Design and Optimization of Heterogeneous Manycore Systems Enabled by Emerging Interconnect Technologies: Promises and Challenges , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[39]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[40]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[41]  Yuankun Xue,et al.  Improving NoC performance under spatio-temporal variability by runtime reconfiguration: a general mathematical framework , 2016, 2016 Tenth IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[42]  Partha Pratim Pande,et al.  Performance and Thermal Tradeoffs for Energy-Efficient Monolithic 3D Network-on-Chip , 2018, ACM Trans. Design Autom. Electr. Syst..

[43]  Qing Wu,et al.  Hardware realization of BSB recall function using memristor crossbar arrays , 2012, DAC Design Automation Conference 2012.

[44]  Martin Lukasiewycz,et al.  SAT-decoding in evolutionary algorithms for discrete constrained optimization problems , 2007, 2007 IEEE Congress on Evolutionary Computation.

[45]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[46]  Axel Jantsch,et al.  Self-Awareness in Systems on Chip— A Survey , 2017, IEEE Design & Test.

[47]  Shahin Nazarian,et al.  A load balancing inspired optimization framework for exascale multicore systems: A complex networks approach , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[48]  Radu Marculescu,et al.  Machine Learning and Manycore Systems Design: A Serendipitous Symbiosis , 2018, Computer.

[49]  Hai Li,et al.  EMAT: An Efficient Multi-Task Architecture for Transfer Learning using ReRAM , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[50]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[51]  Partha Pratim Pande,et al.  MOOS , 2019, ACM Trans. Embed. Comput. Syst..

[52]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[53]  Yiran Chen,et al.  ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[54]  Xuehai Qian,et al.  HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[55]  Donatella Sciuto,et al.  Optimization Strategies in Design Space Exploration , 2017, Handbook of Hardware/Software Codesign.

[56]  Yuankun Xue,et al.  Reliable Multi-Fractal Characterization of Weighted Complex Networks: Algorithms and Implications , 2017, Scientific Reports.

[57]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[58]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[59]  Partha Pratim Pande,et al.  Design-Space Exploration and Optimization of an Energy-Efficient and Reliable 3-D Small-World Network-on-Chip , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[60]  Radu Marculescu,et al.  On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems , 2017, IEEE Transactions on Computers.

[61]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[62]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.