A Reconfigurable SIMT Processor for Mobile Ray Tracing With Contention Reduction in Shared Memory

In this paper, we present a reconfigurable SIMT multi-core processor with a shared memory for mobile ray tracing. The proposed processor addresses two issues of SIMT architecture: branch divergence of concurrently executed threads and contention in a shared memory. Performance degradation due to the branch divergence is reduced by dividing a wide SIMT datapath into several narrow SIMT cores that execute independent threads asynchronously. The contention in a shared memory caused by the multiple SIMT cores is alleviated by introducing a new time-division multiplexing (TDM) scheme using multi-phase clocks. The SIMT cores send their requests to a shared memory sequentially not concurrently by synchronizing the SIMT cores with multi-phase clocks to hide arbitration delays. The processor achieves the same datapath utilization as 4-wide SIMT which has been widely used by CPU-based ray tracers while its area remains 68% of the 4-wide SIMT. As a result, the performance normalized to area is improved by 26% compared to previous work with negligible overheads (2.6% for area and 1% for power consumption). The chip was fabricated in 90 nm CMOS technology, and it contains 2.3 M logic gates and 19.3 KB SRAM. It consumes 221 mW at 100 MHz with Vdd=1.2 V.

[1]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  D. Geer,et al.  Chip makers turn to multicore processors , 2005, Computer.

[3]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[4]  Young-Jun Kim,et al.  MRTP: Mobile Ray Tracing Processor With Reconfigurable Stream Multi-Processors for High Datapath Utilization , 2012, IEEE Journal of Solid-State Circuits.

[5]  C.H. van Berkel,et al.  Multi-core for mobile phones , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[6]  Philipp Slusallek,et al.  SaarCOR: a hardware architecture for ray tracing , 2002, HWWS '02.

[7]  Ryan W. Apperson,et al.  AsAP: An Asynchronous Array of Simple Processors , 2008, IEEE Journal of Solid-State Circuits.

[8]  Young-Jun Kim,et al.  Bank-partition and multi-fetch scheme for floating-point special function units in multi-core systems , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[9]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[10]  Pat Hanrahan,et al.  Ray tracing on programmable graphics hardware , 2002, SIGGRAPH Courses.

[11]  I. Wald,et al.  Ray Tracing on the Cell Processor , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[12]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[13]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[14]  D.V. Anderson,et al.  Trends in multicore DSP platforms , 2009, IEEE Signal Processing Magazine.

[15]  William J. Dally,et al.  A Programmable 512 GOPS Stream Processor for Signal, Image, and Video Processing , 2007, IEEE Journal of Solid-State Circuits.

[16]  Ahmed Sameh,et al.  The Illiac IV system , 1972 .

[17]  Timo Aila,et al.  Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.

[18]  E. Brunvand,et al.  Estimating Performance of a Ray-Tracing ASIC Design , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[19]  Lee-Sup Kim,et al.  A Dual-Shader 3-D Graphics Processor With Fast 4-D Vector Inner Product Units and Power-Aware Texture Cache , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[20]  Greg Humphreys,et al.  Physically Based Rendering: From Theory to Implementation , 2004 .

[21]  Peter Pirsch,et al.  Multicore system-on-chip architecture for MPEG-4 streaming video , 2002, IEEE Trans. Circuits Syst. Video Technol..

[22]  Joseph Zambreno,et al.  Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.