Supporting Static Binding in Stream Rewriting for Heterogeneous Many-Core Architectures

Heterogeneous multi and many-core systems offer numerous benefits like reduced energy consumption and improved throughput for both high-performance and low-power applications. However, beside the design of the actual hardware architecture, also the programming of many-core systems raises several challenges. For this purpose, we extend the existing concept of stream rewriting into a model of computation for the specification of highly concurrent applications on heterogeneous systems. In particular, our approach permits to partition the stream and to bind different sections to specialized hardware components. Since stream rewriting manages a large number of active tasks without a central scheduler, we can perform the distribution and synchronization of work-items asynchronously to improve hardware utilization. Several case studies using an FPGA prototype demonstrate the scalability of our approach.

[1]  Dietger van Antwerpen,et al.  Improving SIMD efficiency for parallel Monte Carlo light transport on the GPU , 2011, HPG '11.

[2]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[3]  Pascal Gautron,et al.  GPU Shape Grammars , 2012, Comput. Graph. Forum.

[4]  Philipp Slusallek,et al.  RPU: a programmable ray processing unit for realtime ray tracing , 2005, ACM Trans. Graph..

[5]  Pat Hanrahan,et al.  GRAMPS: A programming model for graphics pipelines , 2009, ACM Trans. Graph..

[6]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[7]  Tack-Don Han,et al.  SGRT: a mobile GPU architecture for real-time ray tracing , 2013, HPG '13.

[8]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[9]  Christian Haubelt,et al.  A novel graphics processor architecture based on partial stream rewriting , 2013, 2013 Conference on Design and Architectures for Signal and Image Processing.

[10]  Christian Haubelt,et al.  A Programmable Graphics Processor based on Partial Stream Rewriting , 2013, Comput. Graph. Forum.

[11]  Samuli Laine,et al.  High-performance software rasterization on GPUs , 2011, HPG '11.

[12]  Dieter Schmalstieg,et al.  Softshell , 2012, ACM Transactions on Graphics.

[13]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[14]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[15]  Yi Yang,et al.  Many-thread aware instruction-level parallelism: Architecting shader cores for GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Kun Zhou,et al.  RenderAnts: interactive Reyes rendering on GPUs , 2009, SIGGRAPH 2009.

[17]  Aaron E. Lefohn,et al.  Multi-fragment effects on the GPU using the k-buffer , 2007, SI3D.

[18]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[19]  Fang Liu,et al.  FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects , 2010, I3D '10.

[20]  Timo Aila,et al.  Megakernels considered harmful: wavefront path tracing on GPUs , 2013, HPG '13.

[21]  Justin Hensley,et al.  Real‐Time Concurrent Linked List Construction on the GPU , 2010, Comput. Graph. Forum.

[22]  Christian Haubelt,et al.  Dynamic task mapping onto multi-core architectures through stream rewriting , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[23]  Yoichiro Kawaguchi,et al.  A morphological study of the form of nature , 1982, SIGGRAPH.

[24]  Robert M. Farber,et al.  CUDA Application Design and Development , 2011 .

[25]  Reiner W. Hartenstein,et al.  A decade of reconfigurable computing: a visionary retrospective , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[26]  Nan Zhang Memory-Hazard-Aware K-Buffer Algorithm for Order-Independent Transparency Rendering. , 2013, IEEE transactions on visualization and computer graphics.

[27]  Craig M. Wittenbrink R-buffer: a pointerless A-buffer hardware architecture , 2001, HWWS '01.

[28]  Ingo Wald Active thread compaction for GPU path tracing , 2011, HPG '11.

[29]  Jörg Peters,et al.  Curved PN triangles , 2001, I3D '01.

[30]  Homan Igehy,et al.  Pomegranate: a fully scalable graphics architecture , 2000, SIGGRAPH.

[31]  En-Hua Wu,et al.  CUDA renderer: a programmable graphics pipeline , 2009, SIGGRAPH ASIA '09.

[32]  Brian A. Barsky,et al.  Advanced Renderman: Creating CGI for Motion Pictures , 1999 .

[33]  William R. Mark,et al.  The F-buffer: a rasterization-order FIFO buffer for multi-pass rendering , 2001, HWWS '01.

[34]  Anjul Patney,et al.  Task management for irregular-parallel workloads on the GPU , 2010, HPG '10.

[35]  Herman Schmit,et al.  Queue machines: hardware compilation in hardware , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[36]  Klaus Schneider,et al.  Out-Of-order execution of synchronous data-flow networks , 2012, 2012 International Conference on Embedded Computer Systems (SAMOS).

[37]  Christian Haubelt,et al.  Hardware synthesis of recursive functions through partial stream rewriting , 2012, DAC Design Automation Conference 2012.

[38]  Ioannis Fudos,et al.  k+-buffer: fragment synchronized k-buffer , 2014, I3D.