Effective static bin patterns for sort-middle rendering

To effectively utilize an ever increasing number of processors during parallel rendering, hardware and software designers rely on sophisticated load balancing strategies. While dynamic load balancing is a powerful solution, it requires complex work distribution and synchronization mechanisms. Graphics hardware manufacturers have opted to employ static load balancing strategies instead. Specifically, triangle data is distributed to processors based on its overlap with screenspace tiles arranged in a fixed pattern. While the current strategy of using simple patterns for a small number of fast rasterizers achieves formidable performance, it is questionable how this approach will scale as the number of processors increases further. To address this issue, we analyze real-world rendering workloads, derive requirements for effective patterns, and present ten different pattern design strategies based on these requirements. In addition to a theoretical evaluation of these design strategies, we compare the performance of select patterns in a parallel sort-middle software rendering pipeline on an extensive set of triangle data captured from eight recent video games. As a result, we are able to identify a set of patterns that scale well and exhibit significantly improved performance over naïve approaches.

[1]  Lizhe Wang,et al.  Large scale distributed visualization on computational Grids: A review , 2011, Comput. Electr. Eng..

[2]  Anjul Patney,et al.  Piko: a framework for authoring programmable graphics pipelines , 2015, ACM Trans. Graph..

[3]  Carl J. Beckmann,et al.  Optimal static 2-dimensional screen subdivision for parallel rasterization architectures , 1997, Comput. Graph..

[4]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[5]  Homan Igehy,et al.  Simple models of the impact of overlap in bucket rendering , 1998, Workshop on Graphics Hardware.

[6]  Dan Crisu Hardware Algorithms For Tile-Based Real-Time Rendering , 2012 .

[7]  Pat Hanrahan,et al.  Designing graphics architectures around scalability and communication , 2001 .

[8]  John G. Eyles,et al.  PixelFlow: high-speed rendering using image composition , 1992, SIGGRAPH.

[9]  Michael F. Deering Data complexity for virtual reality: where do all the triangles go? , 1993, Proceedings of IEEE Virtual Reality Annual International Symposium.

[10]  Henry Fuchs,et al.  Pixel-planes 5: a heterogeneous multiprocessor graphics system using processor-enhanced memories , 1989, SIGGRAPH.

[11]  Rynson W. H. Lau,et al.  Adaptive Parallel Rendering on Multiprocessors and Workstation Clusters , 2001, IEEE Trans. Parallel Distributed Syst..

[12]  Jiawen Chen,et al.  A reconfigurable architecture for load-balanced rendering , 2005, HWWS '05.

[13]  Thierry Carrard,et al.  Hybrid CPU-GPU unstructured meshes parallel volume rendering on PC clusters , 2007, EGPGV '07.

[14]  Robert Toth,et al.  A sort-based deferred shading architecture for decoupled sampling , 2013, ACM Trans. Graph..

[15]  Homan Igehy,et al.  Pomegranate: a fully scalable graphics architecture , 2000, SIGGRAPH.

[16]  Philip J. Rhodes,et al.  Optimizing memory access on GPUs using morton order indexing , 2010, ACM SE '10.

[17]  Samuli Laine,et al.  High-performance software rasterization on GPUs , 2011, HPG '11.

[18]  Thomas W. Crockett,et al.  A MIMD rendering algorithm for distributed memory architectures , 1993 .

[19]  Henry Fuchs,et al.  A sorting classification of parallel rendering , 1994, IEEE Computer Graphics and Applications.

[20]  Dieter Schmalstieg,et al.  A high-performance software graphics pipeline architecture for the GPU , 2018, ACM Trans. Graph..