Invasive Tightly-Coupled Processor Arrays

We introduce a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs). The presented processor class is a highly parameterizable template which can be tailored before runtime to fulfill costumers' requirements such as performance, area cost, and energy efficiency. These programmable accelerators are well suited for domain-specific computing from the areas of signal, image, and video processing as well as other streaming processing applications. To overcome future scaling issues (e.g., power consumption, reliability, resource management, as well as application parallelization and mapping), TCPAs are inherently designed in way that they support self-adaptivity and resource awareness at hardware level. Here, we follow a recently introduced resource-aware parallel computing paradigm called invasive computing where an application can dynamically claim, execute, and release the resources. Furthermore, we show how invasive computing can be used as an enabler for power management. For the first time, we present a seamless mapping flow for TCPAs, based on a domain-specific language. Moreover, we outline a complete symbolic mapping approach. Finally, we support our claims by comparing a TCPA against an ARM Mali-T604 GPU in terms of performance and energy efficiency.

[1]  Jürgen Becker,et al.  Multiprocessor System-on-Chip - Hardware Design and Tool Integration , 2011, Multiprocessor System-on-Chip.

[2]  Jürgen Teich,et al.  A deeply pipelined and parallel architecture for denoising medical images , 2010, 2010 International Conference on Field-Programmable Technology.

[3]  Olivier Temam,et al.  A Practical Approach for Reconciling High and Predictable Performance in Non-Regular Parallel Programs , 2008, 2008 Design, Automation and Test in Europe.

[4]  Gerald H. Hilderink,et al.  Parallel Processing — the picoChip way! , 2003 .

[5]  L. Gwennap ADAPTEVA : MORE FLOPS , LESS WATTS Epiphany Offers Floating-Point Accelerator for Mobile Processors , 2011 .

[6]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[7]  Scott A. Mahlke,et al.  Trimaran: An Infrastructure for Research in Instruction-Level Parallelism , 2004, LCPC.

[8]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[9]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[10]  Jürgen Teich,et al.  PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications , 2008, ARC.

[11]  Jingling Xue,et al.  Unimodular Transformations of Non-Perfectly Nested Loops , 1997, Parallel Comput..

[12]  Paul Feautrier,et al.  Polyhedron Model , 2011, Encyclopedia of Parallel Computing.

[13]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[14]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[15]  Narayanan Vijaykrishnan,et al.  Exploiting Heterogeneity for Energy Efficiency in Chip Multiprocessors , 2011, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[16]  Fadi J. Kurdahi,et al.  Automatic compilation to a coarse-grained reconfigurable system-opn-chip , 2003, TECS.

[17]  Markus Weinhardt,et al.  PACT XPP—A Self-Reconfigurable Data Processing Architecture , 2003, The Journal of Supercomputing.

[18]  Jürgen Teich,et al.  Power-Efficient Reconfiguration Control in Coarse-Grained Dynamically Reconfigurable Architectures , 2009, J. Low Power Electron..

[19]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[20]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[21]  Martin Fowler,et al.  Domain-Specific Languages , 2010, The Addison-Wesley signature series.

[22]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[23]  Jürgen Teich,et al.  Invasive Computing: An Overview , 2011, Multiprocessor System-on-Chip.

[24]  Jürgen Teich,et al.  Symbolic parallelization of loop programs for massively parallel processor arrays , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[25]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[26]  Jörg Henkel,et al.  Invasive manycore architectures , 2012, 17th Asia and South Pacific Design Automation Conference.

[27]  Kiyoung Choi,et al.  An algorithm for mapping loops onto coarse-grained reconfigurable architectures , 2003, LCTES '03.

[28]  Balaram Sinharoy,et al.  POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).

[29]  Vikram Bhatt,et al.  The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future , 2011, IEEE Micro.

[30]  Jürgen Teich,et al.  MAML: An ADL for Designing Single and Multiprocessor Architectures , 2008 .

[31]  Jürgen Teich,et al.  A highly parameterizable parallel processor array architecture , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[32]  Jürgen Teich,et al.  Hierarchical Partitioning for Piecewise Linear Algorithms , 2006, International Symposium on Parallel Computing in Electrical Engineering (PARELEC'06).

[33]  Olivier Temam,et al.  CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[34]  Vahid Lari,et al.  Invasive tightly coupled processor arrays = Invasive eng gekoppelte Prozessorfelder , 2016 .

[35]  Christian Lengauer,et al.  Towards systolizing compilation , 1991, Distributed Computing.

[36]  Jürgen Teich,et al.  Towards Symbolic Run-Time Reconfiguration in Tightly-Coupled Processor Arrays , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[37]  Jürgen Teich,et al.  Resource constrained and speculative scheduling of an algorithm class with run-time dependent conditionals , 2004, Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004..

[38]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[39]  Bjorn De Sutter,et al.  Architecture Enhancements for the ADRES Coarse-Grained Reconfigurable Array , 2008, HiPEAC.

[40]  Nikil Dutt,et al.  Processor Description Languages , 2008 .

[41]  Jürgen Teich,et al.  Decentralized dynamic resource management support for massively parallel processor arrays , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[42]  Mike Butts,et al.  Synchronization through Communication in a Massively Parallel Processor Array , 2007, IEEE Micro.

[43]  BagherzadehNader,et al.  Automatic compilation to a coarse-grained reconfigurable system-opn-chip , 2003 .

[44]  Jürgen Teich,et al.  Invasive Algorithms and Architectures Invasive Algorithmen und Architekturen , 2008, it Inf. Technol..

[45]  Jürgen Teich,et al.  Hierarchical power management for adaptive tightly-coupled processor arrays , 2013, TODE.