Performance Optimization and Auto-Tuning

In the broader computational research community, one subject of recent research is the problem of adapting algorithms to make effective use of multi- and many-core processors. Effective use of these architectures, which have complex memory hierarchies with many layers of cache, typically involves a careful examination of how an algorithm moves data through the memory hierarchy. Unfortunately, there is often a non-obvious relationship between algorithmic parameters like blocking strategies, and their impact on memory utilization, and, in turn, the relationship with runtime performance. Auto-tuning is an empirical method used to discover optimal values for tunable algorithmic parameters under such circumstances. The challenge is compounded by the fact that the settings that produce the best performance for a given problem and a given platform may not be the best for a different problem on the same platform, or the same problem on a different platform. The high performance visualization research community has begun to explore and adapt the principles of auto-tuning for the purpose of optimizing codes on modern multi- and many-core processors. This report focuses on how performance optimization studies reveal a dramatic variation in performance for two fundamental visualization algorithms: one based on a stencil operation having structured, uniform memory more » access, and the other is ray casting volume rendering, which uses unstructured memory access patterns. The two case studies highlighted in this report show the extra effort required to optimize such codes by adjusting the tunable algorithmic parameters can return substantial gains in performance. Additionally, these case studies also explore the potential impact of and the interaction between algorithmic optimizations and tunable algorithmic parameters, along with the potential performance gains resulting from leveraging architecture-specific features. « less

[1]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Robert B. Ross,et al.  A configurable algorithm for parallel image-compositing applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[3]  Marc Levoy,et al.  Volume rendering on scalable shared-memory MIMD architectures , 1992, VVS.

[4]  Samuel Williams,et al.  A Generalized Framework for Auto-tuning Stencil Computations , 2009 .

[5]  Jens H. Krüger,et al.  Tuvok, an Architecture for Large Scale Volume Rendering , 2010, VMV.

[6]  Stefan Bruckner,et al.  A refined data addressing and processing scheme to accelerate volume raycasting , 2004, Comput. Graph..

[7]  Marc Levoy,et al.  Display of surfaces from volume data , 1988, IEEE Computer Graphics and Applications.

[8]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[9]  V. Pascucci,et al.  Global Static Indexing for Real-Time Exploration of Very Large Regular Grids , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[10]  Roberto Manduchi,et al.  Bilateral filtering for gray and color images , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[11]  Enrico Gobbetti,et al.  A single-pass GPU ray casting framework for interactive out-of-core rendering of massive volumetric datasets , 2008, The Visual Computer.

[12]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Ananta Tiwari,et al.  End-to-End Auto-Tuning with Active Harmony , 2010 .

[14]  Simon Stegmaier,et al.  A simple and flexible volume rendering framework for graphics-hardware-based raycasting , 2005, Fourth International Workshop on Volume Graphics, 2005..

[15]  E. Wes Bethel,et al.  MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems , 2010, EGPGV@Eurographics.

[16]  P. Slusallek,et al.  High-speed volume ray casting with CUDA , 2008, 2008 IEEE Symposium on Interactive Ray Tracing.

[17]  Kenneth I. Joy,et al.  Streamline Integration Using MPI-Hybrid Parallelism on a Large Multicore Architecture , 2011, IEEE Transactions on Visualization and Computer Graphics.

[18]  Robert Latham,et al.  End-to-End Study of Parallel Volume Rendering on the IBM Blue Gene/P , 2009, 2009 International Conference on Parallel Processing.