Function-Level Processor (FLP): A Novel Processor Class for Efficient Processing of Streaming Applications

The exponential growth in computation demand drives chip vendors to heterogeneous architectures combining Instruction-Level Processors (ILPs) and custom HW Accelerators (HWACCs) in an attempt to provide the needed processing capabilities while meeting power/energy requirements. ILPs, on one hand, are highly flexible, but power inefficient. Custom HWACCs, on the other hand, are inflexible (focusing on dedicated kernels), but highly power efficient. New processing architectures are needed that combine the power efficiency of HWACCs while still retaining sufficient flexibility to realize applications across targeted markets. This article introduces Function-Level Processors (FLPs) to fill the gap between ILPs and dedicated HWACCs. FLPs are comprised of configurable Function Blocks (FBs) implementing selected functions which are then interconnected via programmable point-to-point connections constructing an extensible/configurable macro data-path. An FLP raises programming abstraction to a Function-Set Architecture (FSA) controlling FBs allocation, configuration and scheduling. We demonstrate FLP benefits with an industry example of the Pipeline-Vision Processor (PVP). We highlight the gained flexibility by mapping 10 embedded vision applications entirely to the FLP-PVP offering up to 22.4 GOPs/s with an average power of 120 mW. The results also demonstrate that our FLP-PVP solution consumes 1/18th - 1/14th of the power of an ILP and 1/5th of the power of a hybrid ILP+HWACCs.

[1]  Scott A. Mahlke,et al.  Polymorphic Pipeline Array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Gunar Schirner,et al.  Flexible function-level acceleration of embedded vision applications using the Pipelined Vision Processor , 2013, 2013 Asilomar Conference on Signals, Systems and Computers.

[3]  Reiner W. Hartenstein,et al.  A decade of reconfigurable computing: a visionary retrospective , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[4]  Gu-Yeon Wei,et al.  The Accelerator Store framework for high-performance, low-power accelerator-based systems , 2010, IEEE Computer Architecture Letters.

[5]  Jason Cong,et al.  CHARM: a composable heterogeneous accelerator-rich microprocessor , 2012, ISLPED '12.

[6]  Scott A. Mahlke,et al.  Bridging the computation gap between programmable processors and hardwired accelerators , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7]  Alessandro Forin,et al.  Minimizing partial reconfiguration overhead with fully streaming DMA engines and intelligent ICAP controller (abstract only) , 2010, FPGA '10.

[8]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[9]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[10]  Anil Krishna,et al.  Hardware acceleration in the IBM PowerEN processor: architecture and performance , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Robert A. van de Geijn,et al.  A high-performance, low-power linear algebra core , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[12]  Sander Stuijk,et al.  Modeling static-order schedules in synchronous dataflow graphs , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13]  Sri Parameswaran,et al.  Multi-mode pipelined MPSoCs for streaming applications , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[14]  Christoforos E. Kozyrakis,et al.  Towards energy-proportional datacenter memory with mobile DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[15]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Jason Cong,et al.  Composable accelerator-rich microprocessor enhanced for adaptivity and longevity , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[17]  Hsien-Hsin S. Lee,et al.  Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Steven Swanson,et al.  QSCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Jonathan Rose,et al.  Measuring the Gap Between FPGAs and ASICs , 2007, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[20]  Ben H. H. Juurlink,et al.  The SARC Architecture , 2010, IEEE Micro.

[21]  Omesh Tickoo,et al.  HiPPAI: High Performance Portable Accelerator Interface for SoCs , 2009, 2009 International Conference on High Performance Computing (HiPC).

[22]  Rudy Lauwereins,et al.  ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.

[23]  Jason Cong,et al.  Architecture support for accelerator-rich CMPs , 2012, DAC Design Automation Conference 2012.

[24]  Gunar Schirner,et al.  Function-Level Processor (FLP): Raising efficiency by operating at function granularity for market-oriented MPSoC , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[25]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[26]  Victor M. Brea,et al.  SIMD/MIMD Dynamically-Reconfigurable Architecture for High-Performance Embedded Vision Systems , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[27]  Michael C. Huang,et al.  Efficient data streaming with on-chip accelerators: Opportunities and challenges , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[28]  Liang Tang,et al.  Reconfigurable pipelined coprocessor for multi-mode communication transmission , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[29]  William Thies,et al.  A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[30]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[31]  Scott A. Mahlke,et al.  VEAL: Virtualized Execution Accelerator for Loops , 2008, 2008 International Symposium on Computer Architecture.