Pushing the limits of accelerator efficiency while retaining programmability

The waning benefits of device scaling have caused a push towards domain specific accelerators (DSAs), which sacrifice programmability for efficiency. While providing huge benefits, DSAs are prone to obsoletion due to domain volatility, have recurring design and verification costs, and have large area footprints when multiple DSAs are required in a single device. Because of the benefits of generality, this work explores how far a programmable architecture can be pushed, and whether it can come close to the performance, energy, and area efficiency of a DSA-based approach. Our insight is that DSAs employ common specialization principles for concurrency, computation, communication, data-reuse and coordination, and that these same principles can be exploited in a programmable architecture using a composition of known microarchitectural mechanisms. Specifically, we propose and study an architecture called LSSD, which is composed of many low-power and tiny cores, each having a configurable spatial architecture, scratchpads, and DMA. Our results show that a programmable, specialized architecture can indeed be competitive with a domain-specific approach. Compared to four prominent and diverse DSAs, LSSD can match the DSAs' 10× to 150× speedup over an OOO core, with only up to 4× more area and power than a single DSA, while retaining programmability.

[1]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[2]  Kunle Olukotun,et al.  Hardware system synthesis from Domain-Specific Languages , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[3]  Christoph Hagleitner,et al.  Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Paolo Faraboschi,et al.  Custom-fit processors: letting applications define architectures , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[5]  Jason Cong,et al.  CHARM: a composable heterogeneous accelerator-rich microprocessor , 2012, ISLPED '12.

[6]  Gu-Yeon Wei,et al.  Shrink-Fit: A Framework for Flexible Accelerator Sizing , 2013, IEEE Computer Architecture Letters.

[7]  Charlie Johnson,et al.  IBM Power Edge of Network Processor: A Wire-Speed System on a Chip , 2011, IEEE Micro.

[8]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[10]  Yale N. Patt,et al.  MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[12]  Ran Ginosar,et al.  Generalized MultiAmdahl: Optimization of Heterogeneous Multi-Accelerator SoC , 2014, IEEE Computer Architecture Letters.

[13]  Karthikeyan Sankaralingam,et al.  Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[14]  Kenneth A. Ross,et al.  Q100: the architecture and design of a database processing unit , 2014, ASPLOS.

[15]  Ron K. Cytron,et al.  A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[16]  Hsien-Hsin S. Lee,et al.  Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.

[17]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[18]  Kunle Olukotun,et al.  A Heterogeneous Parallel Framework for Domain-Specific Languages , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[19]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[20]  Gu-Yeon Wei,et al.  The accelerator store: A shared memory framework for accelerator-based systems , 2012, TACO.

[21]  Robert P. Colwell,et al.  The chip design game at the end of Moore's law , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[22]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[23]  David A. Wood,et al.  WiDGET: Wisconsin decoupled grid execution tiles , 2010, ISCA.

[24]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[25]  Kunle Olukotun,et al.  Implementing Domain-Specific Languages for Heterogeneous Parallel Computing , 2011, IEEE Micro.

[26]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[27]  Jason Cong,et al.  Composable accelerator-rich microprocessor enhanced for adaptivity and longevity , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[28]  Rachel Courtland The end of the shrink , 2013, IEEE Spectrum.

[29]  Christoforos E. Kozyrakis,et al.  Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[30]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[31]  Scott A. Mahlke,et al.  Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[32]  Jonathan S. Turner,et al.  Packet classification using extended TCAMs , 2003, 11th IEEE International Conference on Network Protocols, 2003. Proceedings..

[33]  Kunle Olukotun,et al.  OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[34]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[35]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[36]  Joo-Young Kim,et al.  A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[37]  Joseph Yiu,et al.  The definitive guide to the ARM Cortex-M3 , 2007 .

[38]  Yoav Etsion,et al.  Single-graph multiple flows: Energy efficient design alternative for GPGPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[39]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[40]  Karthikeyan Sankaralingam,et al.  A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.

[41]  Kenneth A. Ross,et al.  Navigating big data with high-throughput, energy-efficient data partitioning , 2013, ISCA.

[42]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).