Alleviating Scalability Limitation of Accelerator-Based Platforms

Accelerator-based chip multiprocessors (ACMPs), which combine application-specific HW accelerators (ACCs) with host processor core(s), are promising architectures for high-performance and power-efficient computing. However, ACMPs with many ACCs have scalability limitations. The ACCs’ performance benefits can be overshadowed by bottlenecks on shared resources of processor core(s), communication fabric/DMA, and on-chip memory. Primarily, this is rooted in the ACCs’ data access and the orchestration dependency. Due to very loosely defined ACC communication semantics, and relying on general architectures, the resources bottlenecks hamper performance. This paper explores and alleviates the scalability limitations of ACMPs. To this end, this paper first proposes <italic>ACMPerf</italic>, an analytical model to capture the impact of the resources bottlenecks on the achievable ACCs’ benefits. Then, this paper identifies and formalizes ACC communication semantics which paves the path toward a more scalable integration of ACCs. The semantics describe four primary aspects: 1) data access; 2) data granularity; 3) data marshalling; and 4) synchronization. Finally, this paper proposes a novel architecture of transparent self-synchronizing accelerators (TSS). TSS efficiently realizes our identified communication semantics of direct ACC-to-ACC connections often occurring in streaming applications. TSS delivers more of the ACCs’ benefits than conventional ACMP architectures. Given the same set of ACCs, TSS has up to <inline-formula> <tex-math notation="LaTeX">$130 \times $ </tex-math></inline-formula> higher throughput and <inline-formula> <tex-math notation="LaTeX">$78 \times $ </tex-math></inline-formula> lower energy consumption, mainly due to reducing the load on shared architectural resources by <inline-formula> <tex-math notation="LaTeX">$78.3 \times $ </tex-math></inline-formula>.

[1]  Gu-Yeon Wei,et al.  The Aladdin Approach to Accelerator Design and Modeling , 2015, IEEE Micro.

[2]  William Thies,et al.  A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[3]  Jason Cong,et al.  AXR-CMP : Architecture Support in Accelerator-Rich CMPs , 2011 .

[4]  Gu-Yeon Wei,et al.  The Accelerator Store framework for high-performance, low-power accelerator-based systems , 2010, IEEE Computer Architecture Letters.

[5]  David A. Wood,et al.  LogCA: A Performance Model for Hardware Accelerators , 2015, IEEE Computer Architecture Letters.

[6]  Jason Cong,et al.  CHARM: a composable heterogeneous accelerator-rich microprocessor , 2012, ISLPED '12.

[7]  Omesh Tickoo,et al.  HiPPAI: High Performance Portable Accelerator Interface for SoCs , 2009, 2009 International Conference on High Performance Computing (HiPC).

[8]  Ran Ginosar,et al.  Generalized MultiAmdahl: Optimization of Heterogeneous Multi-Accelerator SoC , 2014, IEEE Computer Architecture Letters.

[9]  Luca Benini,et al.  Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators , 2015, Journal of Real-Time Image Processing.

[10]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[11]  Andreas Gerstlauer,et al.  Heterogeneous multiprocessor mapping for real-time streaming systems , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Andreas Gerstlauer,et al.  System-on-Chip Environment: A SpecC-Based Framework for Heterogeneous MPSoC Design , 2008, EURASIP J. Embed. Syst..

[13]  Jason Cong,et al.  Architecture support for accelerator-rich CMPs , 2012, DAC Design Automation Conference 2012.

[14]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[15]  Luca P. Carloni,et al.  Accelerator Memory Reuse in the Dark Silicon Era , 2014, IEEE Computer Architecture Letters.

[16]  Jason Cong,et al.  BiN: a buffer-in-NUCA scheme for accelerator-rich CMPs , 2012, ISLPED '12.

[17]  Gu-Yeon Wei,et al.  Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[18]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[19]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[20]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[21]  Gaurav Agarwal,et al.  “Get smart” with TI’s embedded analytics technology , 2012 .

[22]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[24]  Christoforos E. Kozyrakis,et al.  Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[25]  Gunar Schirner,et al.  Revisiting accelerator-rich CMPs: Challenges and solutions , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[26]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[27]  Jason Cong,et al.  Accelerator-rich architectures: Opportunities and progresses , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[28]  Jason Cong,et al.  Composable accelerator-rich microprocessor enhanced for adaptivity and longevity , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[29]  Gunar Schirner,et al.  Function-Level Processor (FLP): A High Performance, Minimal Bandwidth, Low Power Architecture for Market-Oriented MPSoCs , 2014, IEEE Embedded Systems Letters.

[30]  P. Pham-Quoc Cuong Hybrid Interconnect Design for Heterogeneous Hardware Accelerators , 2015 .

[31]  Kari Pulli,et al.  OpenVX: a framework for accelerating computer vision , 2016, SIGGRAPH ASIA Courses.

[32]  Gunar Schirner,et al.  Flexible function-level acceleration of embedded vision applications using the Pipelined Vision Processor , 2013, 2013 Asilomar Conference on Signals, Systems and Computers.

[33]  Ben H. H. Juurlink,et al.  The SARC Architecture , 2010, IEEE Micro.

[34]  Christoforos E. Kozyrakis,et al.  Convolution engine , 2015, Commun. ACM.

[35]  David B. Thomas,et al.  Transparent linking of compiled software and synthesized hardware , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36]  Steven Swanson,et al.  QSCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Mark Hempstead,et al.  Metrics for Early-Stage Modeling of Many-Accelerator Architectures , 2013, IEEE Computer Architecture Letters.

[38]  Henk Corporaal,et al.  The neuro vector engine: Flexibility to improve convolutional net efficiency for wearable vision , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[39]  Ute Hoffmann System Design A Practical Guide With Specc , 2016 .

[40]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[41]  Patrick Schaumont,et al.  Data Flow Modeling and Transformation , 2013 .

[42]  Henk Corporaal,et al.  Analyzing synchronous dataflow scenarios for dynamic software-defined radio applications , 2011, 2011 International Symposium on System on Chip (SoC).

[43]  Luca Benini,et al.  He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores , 2016, J. Signal Process. Syst..

[44]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[45]  Alberto L. Sangiovanni-Vincentelli,et al.  Theory of latency-insensitive design , 2001, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..