Specialized Macro-Instructions for Von-Neumann Accelerators

In the last few decades, Von-Neumann super-scalar processors have been the superior approach for improving general purpose processing and hardware specialization was used as a complementary approach. However, the imminent end of Moore’s law indicates voltage scaling and per-transistor switching power can not scale down with the same peace as what Moore’s law predicts. As a result, there is a new interest in hardware specialization to improve performance, power and energy efficiency on specific tasks. Hardware customization so far has effectively targeted programs with abundant parallelism and simple data access patterns using vector extensions, GPUs, and spatial fabrics. However, programs with diverse control and memory behavior have found it challenging to leverage these accelerators. Our work is motivated by observing two main approaches in designing hardware accelerators. First many proposals [45, 20] seem to choose dataflow-based accelerators to reduce front-end energy. However, spatial fabrics encounter challenges with fabric utilization and static power when the available instruction parallelism is below the peak operation parallelism available [14]. Second, using custom or magic instructions[19, 10, 23] integrated with the core pipeline reduces both front-end (fetch and decode) and back-end (register access) costs. The problem with this method is the lack of generality across programs because the desired set of custom instructions varies. This dissertation proposes a Von-Neumann based accelerator, Chainsaw and demonstrates that many of the fundamental overheads (e.g., fetch-decode) can be amortized by adopting the appropriate instruction abstraction. To this end, we use the notion of chains, which are compiler fused sequences of instructions. chains convey the producer-consumer locality between dependent instructions, which the Chainsaw architecture then captures by temporally scheduling such operations on the same execution unit and uses bypass registers to forward the values between dependent operations. Chainsaw is a generic multi-lane architecture (4-stage pipeline per lane) and does not require any specialized compound function units; it can be reloaded enabling it to accelerate multiple program paths. We have developed a complete LLVM-based compiler prototype and simulation infrastructure and demonstrated that an 8-lane Chainsaw is within 73% of the performance of an ideal dataflow architecture while reducing the energy consumption by 45% compared to a 4-way OOO processor.

[1]  Yi Pan,et al.  PLUG: flexible lookup modules for rapid deployment of new protocols in high-speed routers , 2009, SIGCOMM '09.

[2]  Karthikeyan Sankaralingam,et al.  A Graph-Based Program Representation for Analyzing Hardware Specialization Approaches , 2015, IEEE Computer Architecture Letters.

[3]  Karthikeyan Sankaralingam,et al.  LEAP: Latency- energy- and area-optimized lookup pipeline , 2012, 2012 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[4]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Scott A. Mahlke,et al.  Polymorphic Pipeline Array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  James E. Smith,et al.  The microarchitecture of superscalar processors , 1995, Proc. IEEE.

[7]  Scott A. Mahlke,et al.  Edge-centric modulo scheduling for coarse-grained reconfigurable architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[9]  Scott A. Mahlke,et al.  Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[10]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[11]  Bill Dally Power, Programmability, and Granularity: The Challenges of ExaScale Computing , 2011, IPDPS.

[12]  Lei Zhang,et al.  A General-Purpose Many-Accelerator Architecture Based on Dataflow Graph Clustering of Applications , 2014, Journal of Computer Science and Technology.

[13]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[14]  Karthikeyan Sankaralingam,et al.  Performance evaluation of a DySER FPGA prototype system spanning the compiler, microarchitecture, and hardware implementation , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[15]  Amin Ansari,et al.  Illusionist: Transforming lightweight cores into aggressive cores on demand , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[16]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[17]  Mikko H. Lipasti,et al.  An approach for implementing efficient superscalar CISC processors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[18]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[19]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[20]  Steven Swanson,et al.  QSCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[22]  Scott A. Mahlke,et al.  Composite Cores: Pushing Heterogeneity Into a Core , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[24]  Karthikeyan Sankaralingam,et al.  Design, integration and implementation of the DySER hardware accelerator into OpenSPARC , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[25]  Scott A. Mahlke,et al.  Trace based phase prediction for tightly-coupled heterogeneous cores , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Karthikeyan Sankaralingam,et al.  Exploring the potential of heterogeneous Von Neumann/dataflow execution models , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[27]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[28]  Apala Guha,et al.  Chainsaw: Von-neumann accelerators to leverage fused instruction chains , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  D. R. Fulkerson Note on Dilworth’s decomposition theorem for partially ordered sets , 1956 .

[30]  Ali Saidi,et al.  The Reconfigurable Streaming Vector Processor (RSVP , 2003 .

[31]  Rudy Lauwereins,et al.  Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling , 2003, DATE.

[32]  William J. Dally,et al.  A compile-time managed multi-level register file hierarchy , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  J. Sanchez,et al.  Flexible compiler-managed L0 buffers for clustered VLIW processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[34]  Mark Horowitz,et al.  Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis , 2010, ISCA.

[35]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[36]  Mikko H. Lipasti,et al.  Revolver: Processor architecture for power efficient loop execution , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[37]  Gu-Yeon Wei,et al.  Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[38]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[39]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[40]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[41]  Ho-Seop Kim,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[42]  Lieven Eeckhout,et al.  Automatic design of domain-specific instructions for low-power processors , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[43]  Michael Taylor A landscape of the new dark silicon design regime , 2013 .

[44]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[45]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[46]  James D. Warnock,et al.  Cell processor low-power design methodology , 2005, IEEE Micro.

[47]  Karthikeyan Sankaralingam,et al.  Analyzing Behavior Specialized Acceleration , 2016, ASPLOS.

[48]  David A. Patterson,et al.  The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V , 2016, ArXiv.

[49]  David Black-Schaffer,et al.  Efficient Embedded Computing , 2008, Computer.

[50]  Scott A. Mahlke,et al.  DynaMOS: Dynamic schedule migration for heterogeneous cores , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[51]  Ricardo E. Gonzalez,et al.  Xtensa: A Configurable and Extensible Processor , 2000, IEEE Micro.