Laika: Efficient In-Place Scheduling for 3D Mesh Graph Computations

Scientific computing problems are frequently solved using data-graph computations -- algorithms that perform local updates on application-specific data associated with vertices of a graph, over many time steps. The data-graph in such computations is commonly a mesh graph, where vertices have positions in 3D space, and edges connect physically nearby vertices. A scheduler controls the parallel execution of the algorithm. Two classes of parallel schedulers exist: double-buffering and in-place. Double-buffering schedulers do not incur synchronization overheads due to an absence of read-write conflicts, but require two copies of the vertices, as well as a higher iteration count due to a slower convergence rate. Computations for which this difference in convergence rate is significant (e.g., multigrid method) are frequently performed using an in-place scheduler, which incurs synchronization overheads to avoid read-write conflicts on the single copy of vertex data. We present Laika, a deterministic in-place scheduler we created using a principled three-step design strategy for high-performance schedulers. Laika reorders the input graph using a Hilbert space-filling curve to improve cache locality and minimizes parallel coordination overhead by explicitly curbing excess execution parallelism. Consequently, Laika has significantly lower scheduling overhead than alternative in-place schedulers and is even faster per iteration than the parallel double-buffered implementation on a reordered input graph. We derive an improved bound on the expected number of cache misses incurred during a traversal of a graph reordered using a space-filling curve. We also prove that on a mesh graph G = (V, E), Laika performs O(|V| + |E|) total work and achieves linear expected speedup with P = O(|V| / log^2 |V|) workers. On 48 cores, Laika yields 38.4x parallel speedup and empirically fares well against comparably well-engineered alternatives: it runs 6.97--12.60 times faster in geometric mean over a suite of input graphs than other parallel schedulers and 222.57 times faster than the baseline serial implementation.

[1]  Wojciech Matusik,et al.  Simit , 2016, ACM Trans. Graph..

[2]  Mark T. Jones,et al.  A Parallel Graph Coloring Heuristic , 1993, SIAM J. Sci. Comput..

[3]  D. Hilbert Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[4]  Manfred Liebmann,et al.  A Hilbert-order multiplication scheme for unstructured sparse matrices , 2007, Int. J. Parallel Emergent Distributed Syst..

[5]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[6]  Christos Faloutsos,et al.  Analysis of the Clustering Properties of the Hilbert Space-Filling Curve , 2001, IEEE Trans. Knowl. Data Eng..

[7]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[8]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[9]  John N. Tritsiklis A comparison of Jacobi and Gauss-Seidel parallel iterations , 1989 .

[10]  Jinshan Zeng,et al.  A Gauss-Seidel Iterative Thresholding Algorithm for lq Regularized Least Squares Regression , 2015, ArXiv.

[11]  Dirk Roose,et al.  High-level strategies for parallel shared-memory sparse matrix – vector multiplication , 2012 .

[12]  M.F. Adams,et al.  A Distributed Memory Unstructured Gauss-Seidel Algorithm for Multigrid Smoothers , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[13]  Hang Si,et al.  TetGen, a Delaunay-Based Quality Tetrahedral Mesh Generator , 2015, ACM Trans. Math. Softw..

[14]  David J. Evans,et al.  Parallel S.O.R. iterative methods , 1984, Parallel Comput..

[15]  Ulrich Rüde,et al.  Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters , 2014, Concurr. Comput. Pract. Exp..

[16]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[17]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[18]  Samuel Williams,et al.  Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[19]  François Pellegrini,et al.  PT-Scotch: A tool for efficient parallel graph ordering , 2008, Parallel Comput..

[20]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[21]  Felix Wolf,et al.  Dynamic Load Balancing for Unstructured Meshes on Space-Filling Curves , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[22]  Ümit V. Çatalyürek,et al.  Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23]  Kab S. Kang Scalable implementation of the parallel multigrid method on massively parallel computers , 2015, Comput. Math. Appl..

[24]  Charles E. Leiserson,et al.  Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling , 2016, ACM Trans. Parallel Comput..

[25]  D. Hilbert Über die stetige Abbildung einer Linie auf ein Flächenstück , 1935 .

[26]  Jonathan J. Hu,et al.  Parallel multigrid smoothing: polynomial versus Gauss--Seidel , 2003 .

[27]  Charles E. Leiserson,et al.  Efficient Detection of Determinacy Races in Cilk Programs , 1997, SPAA '97.

[28]  S. McCormick,et al.  A multigrid tutorial (2nd ed.) , 2000 .

[29]  William Hasenplaugh,et al.  Parallel algorithms for scheduling data-graph computations , 2016 .

[30]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[31]  Srinivas Aluru,et al.  A Formal Analysis of Space Filling Curves for Parallel Domain Decomposition , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[32]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[33]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[34]  J. Ortega,et al.  A multi-color SOR method for parallel computation , 1982, ICPP.