TOD-Tree: Task-Overlapped Direct send Tree Image Compositing for Hybrid MPI Parallelism

Modern supercomputers have very powerful multi-core CPUs. The programming model on these supercomputer is switching from pure MPI to MPI for inter-node communication, and shared memory and threads for intra-node communication. Consequently the bottleneck in most systems is no longer computation but communication between nodes. In this paper, we present a new compositing algorithm for hybrid MPI parallelism that focuses on communication avoidance and overlapping communication with computation at the expense of evenly balancing the workload. The algorithm has three stages: a direct send stage where nodes are arranged in groups and exchange regions of an image, followed by a tree compositing stage and a gather stage. We compare our algorithm with radix-k and binary-swap from the IceT library in a hybrid OpenMP/MPI setting, show strong scaling results and explain how we generally achieve better performance than these two algorithms.

[1]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience: Research Articles , 2007 .

[2]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[3]  Henry Fuchs,et al.  A sorting classification of parallel rendering , 1994, IEEE Computer Graphics and Applications.

[4]  Xavier Cavin,et al.  Shift-Based Parallel Image Compositing on InfiniBandTM Fat-Trees , 2012, EGPGV@Eurographics.

[5]  Kenneth D. Moreland,et al.  IceT users' guide and reference. , 2009 .

[6]  Robert B. Ross,et al.  A configurable algorithm for parallel image-compositing applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7]  Kwan-Liu Ma,et al.  Massively parallel volume rendering using 2-3 swap image compositing , 2008, HiPC 2008.

[8]  E. Wes Bethel,et al.  MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems , 2010, EGPGV@Eurographics.

[9]  Ulrich Neumann Communication costs for parallel volume-rendering algorithms , 1994, IEEE Computer Graphics and Applications.

[10]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[11]  E. Wes Bethel,et al.  Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems , 2012, IEEE Transactions on Visualization and Computer Graphics.

[12]  Charles D. Hansen,et al.  A data distributed, parallel algorithm for ray-traced volume rendering , 1993 .

[13]  Juan Touriño,et al.  Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures , 2009, PVM/MPI.

[14]  Jian Huang,et al.  An image compositing solution at scale , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Ray W. Grout,et al.  Ultrascale Visualization In Situ Visualization for Large-Scale Combustion Simulations , 2010 .

[16]  Kwan-Liu Ma,et al.  SLIC: scheduled linear image compositing for parallel volume rendering , 2003, IEEE Symposium on Parallel and Large-Data Visualization and Graphics, 2003. PVG 2003..

[17]  Michael E. Papka,et al.  Performance Modeling of vl3 Volume Rendering on GPU-Based Clusters , 2014, EGPGV@EuroVis.

[18]  Abhinav Vishnu,et al.  A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems , 2014, Future Gener. Comput. Syst..

[19]  William M. Hsu Segmented ray casting for data parallel volume rendering , 1993 .

[20]  Renato Pajarola,et al.  Eurographics Symposium on Parallel Graphics and Visualization (2007) Direct Send Compositing for Parallel Sort-last Rendering , 2022 .

[21]  Nelson L. Max,et al.  A contract based system for large data visualization , 2005, VIS 05. IEEE Visualization, 2005..

[22]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .