Multigrain parallel Delaunay Mesh generation: challenges and opportunities for multithreaded architectures

Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMT-based architectures, it is imperative to obtain insight on the interaction between meshing algorithms and these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level and fine-grain at the element level. This multigrain data parallel approach targets clusters built from low-end, commercially available SMTs. Our experimental evaluation shows that current SMTs are not capable of executing fine-grain parallelism in PCDM. However, experiments on a simulated SMT indicate that with modest hardware support it is possible to exploit fine-grain parallelism opportunities. The exploitation of fine-grain parallelism results to higher performance than a pure MPI implementation and closes the gap between the performance of PCDM and the state-of-the-art sequential mesher on a single physical processor. Our findings extend to other adaptive and irregular multigrain, parallel algorithms.

[1]  M. Shephard,et al.  Parallel volume meshing using face removals and hierarchical repartitioning , 1999 .

[2]  Nikos Chrisochoides,et al.  Guaranteed: quality parallel delaunay refinement for restricted polyhedral domains , 2002, SCG '02.

[3]  Paul-Louis George,et al.  Delaunay triangulation and meshing : application to finite elements , 1998 .

[4]  Keshav Pingali,et al.  A load balancing framework for adaptive and asynchronous applications , 2004, IEEE Transactions on Parallel and Distributed Systems.

[5]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[6]  John Paul Shen,et al.  Speculative Precomputation : Exploring the Use of Multithreading for Latency 1 Speculative Precomputation : Exploring the Use of Multithreading for Latency , 2002 .

[7]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[8]  Guy E. Blelloch,et al.  Developing a practical projection-based parallel Delaunay algorithm , 1996, SCG '96.

[9]  Jonathan Richard Shewchuk,et al.  Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulator , 1996, WACG.

[10]  Babak Falsafi,et al.  Implicitly-multithreaded processors , 2003, ISCA '03.

[11]  L. Paul Chew,et al.  Constrained Delaunay triangulations , 1987, SCG '87.

[12]  Brad Calder,et al.  Threaded multiple path execution , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[13]  L. Oliker,et al.  Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[14]  Roy Williams,et al.  Adaptive Parallel Meshes with Complex Geometry , 1991 .

[15]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[16]  Mark T. Jones,et al.  Parallel algorithms for the adaptive refinement and partitioning of unstructured meshes , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[17]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[18]  J. Shewchuk,et al.  Delaunay refinement mesh generation , 1997 .

[19]  Andrey N. Chernikov,et al.  Practical and efficient point insertion scheduling method for parallel guaranteed quality delaunay refinement , 2004, ICS '04.

[20]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[21]  George Em Karniadakis,et al.  Nodes, modes and flow codes , 1993 .

[22]  Adrian Bowyer,et al.  Computing Dirichlet Tessellations , 1981, Comput. J..

[23]  Nigel P. Weatherill,et al.  Distributed parallel Delaunay mesh generation , 1999 .

[24]  Dean M. Tullsen,et al.  Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[25]  Microsystems Sun UltraSPARC IV Processor Architecture Overview , 2004 .

[26]  Ron Kikinis,et al.  Real-Time Biomechanical Simulation of Volumetric Brain Deformation for Image Guided Neurosurgery , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[27]  Daniel Pizarro-Perez,et al.  Parallel Refinement of Tetrahedral Meshes Using Terminal-Edge Bisection Algorithm , 2004, IMR.

[28]  Clemens Kadow Adaptive Dynamic Projection-Based Partitioning for Parallel Delaunay Mesh Generation Algorithms , 2003 .

[29]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[30]  L. Paul Chew,et al.  Parallel Constrained Delaunay Meshing , 2007 .

[31]  D. F. Watson Computing the n-Dimensional Delaunay Tesselation with Application to Voronoi Polytopes , 1981, Comput. J..

[32]  Larry Carter,et al.  Multi-processor Performance on the Tera MTA , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[33]  Jerome Galtier,et al.  Prepartitioning as a way to mesh subdomains in parallel , 1997 .

[34]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[35]  Håkan Grahn,et al.  SimICS/Sun4m: A Virtual Workstation , 1998, USENIX Annual Technical Conference.