Exploring the potential of heterogeneous Von Neumann/dataflow execution models

General purpose processors (GPPs), from small inorder designs to many-issue out-of-order, incur large power overheads which must be addressed for future technology generations. Major sources of overhead include structures which dynamically extract the data-dependence graph or maintain precise state. Considering irregular workloads, current specialization approaches either heavily curtail performance, or provide simply too little benefit. Interestingly, well known explicit-dataflow architectures eliminate these overheads by directly executing the data-dependence graph and eschewing instruction-precise recoverability. However, even after decades of research, dataflow architectures have yet to come into prominence as a solution. We attribute this to a lack of effective control speculation and the latency overhead of explicit communication, which is crippling for certain codes. This paper makes the observation that if both out-of-order and explicit-dataflow were available in one processor, many types of GPP cores can benefit from dynamically switching during certain phases of an application's lifetime. Analysis reveals that an ideal explicit-dataflow engine could be profitable for more than half of instructions, providing significant performance and energy improvements. The challenge is to achieve these benefits without introducing excess hardware complexity. To this end, we propose the Specialization Engine for Explicit-Dataflow (SEED). Integrated with an inorder core, we see 1.67× performance and 1.65× energy benefits, with an Out-Of-Order (OOO) dual-issue core we see 1.33× and 1.70×, and with a quad-issue OOO, 1.14× and 1.54×.

[1]  Scott A. Mahlke,et al.  Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Seth Copen Goldstein,et al.  Dataflow: A Complement to Superscalar , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[3]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[4]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Karthikeyan Sankaralingam,et al.  Studying Hybrid Von-Neumann/Dataflow Execution Models , 2015 .

[6]  David A. Wood,et al.  Forwardflow: a scalable core for power-constrained CMPs , 2010, ISCA.

[7]  Jeffrey R. Diamond,et al.  An evaluation of the TRIPS computer system , 2009, ASPLOS.

[8]  Christopher Batten,et al.  Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators , 2013, ACM Trans. Comput. Syst..

[9]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[12]  Seth Copen Goldstein,et al.  Tartan: evaluating spatial computation for whole program execution , 2006, ASPLOS XII.

[13]  Karthikeyan Sankaralingam,et al.  Performance evaluation of a DySER FPGA prototype system spanning the compiler, microarchitecture, and hardware implementation , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[14]  Rastislav Bodík,et al.  Using Interaction Costs for Microarchitectural Bottleneck Analysis , 2003, MICRO.

[15]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[16]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[17]  Scott A. Mahlke,et al.  VEAL: Virtualized Execution Accelerator for Loops , 2008, 2008 International Symposium on Computer Architecture.

[18]  Dean M. Tullsen,et al.  Harnessing ISA diversity: Design of a heterogeneous-ISA chip multiprocessor , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[19]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[20]  Robert A. Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, ISCA '88.

[21]  Karthikeyan Sankaralingam,et al.  A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.

[22]  Shreesha Srinath,et al.  Architectural Specialization for Inter-Iteration Loop Dependence Patterns , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[24]  Mikko H. Lipasti,et al.  Revolver: Processor architecture for power efficient loop execution , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[25]  Scott A. Mahlke,et al.  Composite Cores: Pushing Heterogeneity Into a Core , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[26]  Scott A. Mahlke,et al.  Trace based phase prediction for tightly-coupled heterogeneous cores , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[28]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[29]  Kattamuri Ekanadham,et al.  Incorporating Data Flow Ideas into von Neumann Processors for Parallel Execution , 1987, IEEE Transactions on Computers.

[30]  Karthikeyan Sankaralingam,et al.  Efficient execution of memory access phases using dataflow specialization , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[31]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[32]  Steven Swanson,et al.  QSCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[34]  Scott A. Mahlke,et al.  Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).