COSMOS

Hardware accelerators are key to the efficiency and performance of system-on-chip (SoC) architectures. With high-level synthesis (HLS), designers can easily obtain several performance-cost trade-off implementations for each component of a complex hardware accelerator. However, navigating this design space in search of the Pareto-optimal implementations at the system level is a hard optimization task. We present COSMOS, an automatic methodology for the design-space exploration (DSE) of complex accelerators, that coordinates both HLS and memory optimization tools in a compositional way. First, thanks to the co-design of datapath and memory, COSMOS produces a large set of Pareto-optimal implementations for each component of the accelerator. Then, COSMOS leverages compositional design techniques to quickly converge to the desired trade-off point between cost and performance at the system level. When applied to the system-level design (SLD) of an accelerator for wide-area motion imagery (WAMI), COSMOS explores the design space as completely as an exhaustive search, but it reduces the number of invocations to the HLS tool by up to 14.6×.

[1]  Jason Cong,et al.  Bandwidth optimization through on-chip memory restructuring for HLS , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[2]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[3]  Luca P. Carloni,et al.  System-level memory optimization for high-level synthesis of component-based SoCs , 2014, 2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[4]  Christian Haubelt,et al.  Accelerating design space exploration using pareto-front arithmetics , 2003, ASP-DAC '03.

[5]  Zhen Fang,et al.  Template-based memory access engine for accelerators in SoCs , 2011, 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011).

[6]  Jason Cong,et al.  Accelerator-rich architectures: Opportunities and progresses , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[7]  Luca P. Carloni,et al.  An analysis of accelerator coupling in heterogeneous architectures , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[8]  Lok-Won Kim,et al.  DeepX: Deep Learning Accelerator for Restricted Boltzmann Machine Artificial Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Christian Haubelt,et al.  Electronic System-Level Synthesis Methodologies , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[10]  Frank Ghenassia,et al.  Transaction Level Modeling with SystemC , 2005 .

[11]  Christian Haubelt,et al.  Accelerating design space exploration using Pareto-front arithmetics [SoC design] , 2003, Proceedings of the ASP-DAC Asia and South Pacific Design Automation Conference, 2003..

[12]  Luca P. Carloni,et al.  Broadening the exploration of the accelerator design space in embedded scalable platforms , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[13]  Alberto L. Sangiovanni-Vincentelli,et al.  Quo Vadis, SLD? Reasoning About the Trends and Challenges of System Level Design , 2007, Proceedings of the IEEE.

[14]  Benjamin Carrión Schäfer Probabilistic Multiknob High-Level Synthesis Design Space Exploration Acceleration , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[15]  Jason Cong,et al.  Combined loop transformation and hierarchy allocation for data reuse optimization , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[16]  Daniele Loiacono,et al.  A Multi-objective Genetic Algorithm for Design Space Exploration in High-Level Synthesis , 2008, 2008 IEEE Computer Society Annual Symposium on VLSI.

[17]  Frank Ghenassia Transaction-Level Modeling with SystemC: TLM Concepts and Applications for Embedded Systems , 2010 .

[18]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[19]  Anirban Sengupta,et al.  PSDSE: Particle Swarm Driven Design Space Exploration of Architecture and Unrolling Factors for Nested Loops in High Level Synthesis , 2014, 2014 Fifth International Symposium on Electronic System Design.

[20]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Preeti Ranjan Panda,et al.  The Impact of Loop Unrolling on Controller Delay in High Level Synthesis , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[22]  Luca P. Carloni,et al.  On learning-based methods for design-space exploration with High-Level Synthesis , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[23]  Benjamin Carrion Schafer,et al.  Machine-learning based simulated annealer method for high level synthesis design space exploration , 2014, Proceedings of the 2014 Electronic System Level Synthesis Conference (ESLsyn).

[24]  Jason Cong,et al.  An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25]  Luca P. Carloni,et al.  From Latency-Insensitive Design to Communication-Based System-Level Design , 2015, Proceedings of the IEEE.

[26]  Giovanni Chiola,et al.  Properties and Performance Bounds for Timed Marked Graphs , 1992 .

[27]  Tadao Murata,et al.  Petri nets: Properties, analysis and applications , 1989, Proc. IEEE.

[28]  Luciano Lavagno,et al.  High-Level Synthesis for Semi-Global Matching: Is the Juice Worth the Squeeze? , 2017, IEEE Access.

[29]  Pedro C. Diniz,et al.  A compiler approach to managing storage and memory bandwidth in configurable architectures , 2008, TODE.

[30]  Gu-Yeon Wei,et al.  Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[31]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[32]  Kazutoshi Wakabayashi,et al.  Machine learning predictive modelling high-level synthesis design space exploration , 2012, IET Comput. Digit. Tech..

[33]  Luca P. Carloni,et al.  Compositional system-level design exploration with planning of high-level synthesis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[34]  Jason Cong,et al.  An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[35]  Benjamin Carrion Schafer,et al.  Adaptive Simulated Annealer for high level synthesis design space exploration , 2009, 2009 International Symposium on VLSI Design, Automation and Test.

[36]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[37]  Luca P. Carloni,et al.  System-Level Optimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[38]  André Seznec Bank-interleaved cache or memory indexing does not require euclidean division , 2015 .

[39]  Dirk Stroobandt,et al.  An overview of today’s high-level synthesis tools , 2012, Design Automation for Embedded Systems.

[40]  Don R. Hush,et al.  Wide-Area Motion Imagery , 2010, IEEE Signal Processing Magazine.

[41]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[42]  C. V. Ramamoorthy,et al.  Performance Evaluation of Asynchronous Concurrent Systems Using Petri Nets , 1980, IEEE Transactions on Software Engineering.

[43]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[44]  Gu-Yeon Wei,et al.  The accelerator store: A shared memory framework for accelerator-based systems , 2012, TACO.

[45]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[46]  Yao Chen,et al.  High Level Synthesis of Complex Applications: An H.264 Video Decoder , 2016, FPGA.

[47]  Jason Cong,et al.  Optimizing memory hierarchy allocation with loop transformations for high-level synthesis , 2012, DAC Design Automation Conference 2012.

[48]  Joel Emer,et al.  Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[49]  Luca P. Carloni,et al.  Supervised design space exploration by compositional approximation of Pareto sets , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[50]  Luca P. Carloni,et al.  Invited: The case for Embedded Scalable Platforms , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).