Triggered-Issuance and Triggered-Execution: A Control Paradigm to Minimize Pipeline Stalls in Distributed Controlled Coarse-Grained Reconfigurable Arrays

Distributed controlled coarse-grained reconfigurable arrays (CGRAs) enable efficient execution of irregular control flows by reconciling divergence in the processing elements (PEs). To further improve performance by better exploiting spatial parallelism, the triggered instruction architecture (TIA) eliminates the program counter and branch instructions by converting control flows into predicate dependencies as triggers. However, pipeline stalls, which occur in pipelines composed of both intra and inter-PEs, remain a major obstacle to the overall performance. In fact, the stalls in distributed controlled CGRAs pose a unique problem that is difficult to resolve by previous techniques. This work presents a triggered-issuance and triggered-execution (TITE) paradigm in which the issuance and execution of instructions are separately triggered to further relax the predicate dependencies in TIA. In this paradigm, instructions are paired as dual instructions to eliminate stalls caused by control divergence. Tags that identify the data transmitted between PEs are forwarded for acceleration. As a result, pipeline stalls of both intra- and inter-PEs can be significantly minimized. Experiments show that TITE improves performance by 21 percent, energy efficiency by 17 percent, and area efficiency by 12 percent compared with a baseline TIA.

[1]  Kiyoung Choi,et al.  State-based full predication for low power coarse-grained reconfigurable architecture , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Xiao Yang,et al.  A Hybrid Reconfigurable Architecture and Design Methods Aiming at Control-Intensive Kernels , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Kathryn S. McKinley,et al.  Strategies for mapping dataflow blocks to distributed hardware , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[4]  Aviral Shrivastava,et al.  A Software Scheme for Multithreading on CGRAs , 2015, ACM Trans. Embed. Comput. Syst..

[5]  Russell Tessier,et al.  Reconfigurable Computing Architectures , 2015, Proceedings of the IEEE.

[6]  Scott A. Mahlke,et al.  A comparison of full and partial predicated execution support for ILP processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[7]  Antonia Zhai,et al.  Triggered instructions: a control paradigm for spatially-programmed architectures , 2013, ISCA.

[8]  Wu-chun Feng,et al.  OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures , 2016, J. Signal Process. Syst..

[9]  Steven Swanson,et al.  The WaveScalar architecture , 2007, TOCS.

[10]  Antonia Zhai,et al.  Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures , 2015, ACM Trans. Comput. Syst..

[11]  Ryan W. Apperson,et al.  Architecture and Evaluation of an Asynchronous Array of Simple Processors , 2008, J. Signal Process. Syst..

[12]  Dong Wang,et al.  An energy-efficient coarse-grained dynamically reconfigurable fabric for multiple-standard video decoding applications , 2013, Proceedings of the IEEE 2013 Custom Integrated Circuits Conference.

[13]  Tom Vander Aa,et al.  Mapping of the AES cryptographic algorithm on a Coarse-Grain reconfigurable array processor , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[14]  Aviral Shrivastava,et al.  Branch-aware loop mapping on CGRAs , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Bjorn De Sutter,et al.  Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays , 2008, LCTES '08.

[16]  Bjorn De Sutter,et al.  Implementation of a Coarse-Grained Reconfigurable Media Processor for AVC Decoder , 2008, J. Signal Process. Syst..

[17]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[18]  Thomas M. Conte,et al.  A Benchmark Characterization of the EEMBC Benchmark Suite , 2009, IEEE Micro.

[19]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[20]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[21]  Bjorn De Sutter,et al.  Coarse-Grained Reconfigurable Array Architectures , 2018, Handbook of Signal Processing Systems.

[22]  Chenchen Deng,et al.  TLIA: Efficient Reconfigurable Architecture for Control-Intensive Kernels with Triggered-Long-Instructions , 2016, IEEE Transactions on Parallel and Distributed Systems.

[23]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[24]  Leibo Liu,et al.  Trigger-Centric Loop Mapping on CGRAs , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[25]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[26]  Kiyoung Choi,et al.  Acceleration of control flow on CGRA using advanced predicated execution , 2010, 2010 International Conference on Field-Programmable Technology.