论文信息 - Laika: Efficient In-Place Scheduling for 3D Mesh Graph Computations

Laika: Efficient In-Place Scheduling for 3D Mesh Graph Computations

Scientific computing problems are frequently solved using data-graph computations -- algorithms that perform local updates on application-specific data associated with vertices of a graph, over many time steps. The data-graph in such computations is commonly a mesh graph, where vertices have positions in 3D space, and edges connect physically nearby vertices. A scheduler controls the parallel execution of the algorithm. Two classes of parallel schedulers exist: double-buffering and in-place. Double-buffering schedulers do not incur synchronization overheads due to an absence of read-write conflicts, but require two copies of the vertices, as well as a higher iteration count due to a slower convergence rate. Computations for which this difference in convergence rate is significant (e.g., multigrid method) are frequently performed using an in-place scheduler, which incurs synchronization overheads to avoid read-write conflicts on the single copy of vertex data. We present Laika, a deterministic in-place scheduler we created using a principled three-step design strategy for high-performance schedulers. Laika reorders the input graph using a Hilbert space-filling curve to improve cache locality and minimizes parallel coordination overhead by explicitly curbing excess execution parallelism. Consequently, Laika has significantly lower scheduling overhead than alternative in-place schedulers and is even faster per iteration than the parallel double-buffered implementation on a reordered input graph. We derive an improved bound on the expected number of cache misses incurred during a traversal of a graph reordered using a space-filling curve. We also prove that on a mesh graph G = (V, E), Laika performs O(|V| + |E|) total work and achieves linear expected speedup with P = O(|V| / log^2 |V|) workers. On 48 cores, Laika yields 38.4x parallel speedup and empirically fares well against comparably well-engineered alternatives: it runs 6.97--12.60 times faster in geometric mean over a suite of input graphs than other parallel schedulers and 222.57 times faster than the baseline serial implementation.

[1] Wojciech Matusik,et al. Simit , 2016, ACM Trans. Graph..

[2] Mark T. Jones,et al. A Parallel Graph Coloring Heuristic , 1993, SIAM J. Sci. Comput..

[3] D. Hilbert. Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[4] Manfred Liebmann,et al. A Hilbert-order multiplication scheme for unstructured sparse matrices , 2007, Int. J. Parallel Emergent Distributed Syst..

[5] Ronald L. Rivest,et al. Introduction to Algorithms, third edition , 2009 .

[6] Christos Faloutsos,et al. Analysis of the Clustering Properties of the Hilbert Space-Filling Curve , 2001, IEEE Trans. Knowl. Data Eng..

[7] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[8] John L. Gustafson,et al. Reevaluating Amdahl's law , 1988, CACM.

[9] John N. Tritsiklis. A comparison of Jacobi and Gauss-Seidel parallel iterations , 1989 .

[10] Jinshan Zeng,et al. A Gauss-Seidel Iterative Thresholding Algorithm for lq Regularized Least Squares Regression , 2015, ArXiv.

[11] Dirk Roose,et al. High-level strategies for parallel shared-memory sparse matrix – vector multiplication , 2012 .

[12] M.F. Adams,et al. A Distributed Memory Unstructured Gauss-Seidel Algorithm for Multigrid Smoothers , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[13] Hang Si,et al. TetGen, a Delaunay-Based Quality Tetrahedral Mesh Generator , 2015, ACM Trans. Math. Softw..

[14] David J. Evans,et al. Parallel S.O.R. iterative methods , 1984, Parallel Comput..

[15] Ulrich Rüde,et al. Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters , 2014, Concurr. Comput. Pract. Exp..

[16] Joseph E. Gonzalez,et al. GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[17] H. Chernoff. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[18] Samuel Williams,et al. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[19] François Pellegrini,et al. PT-Scotch: A tool for efficient parallel graph ordering , 2008, Parallel Comput..

[20] Carlos Guestrin,et al. Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[21] Felix Wolf,et al. Dynamic Load Balancing for Unstructured Meshes on Space-Filling Curves , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[22] Ümit V. Çatalyürek,et al. Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23] Kab S. Kang. Scalable implementation of the parallel multigrid method on massively parallel computers , 2015, Comput. Math. Appl..

[24] Charles E. Leiserson,et al. Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling , 2016, ACM Trans. Parallel Comput..

[25] D. Hilbert. Über die stetige Abbildung einer Linie auf ein Flächenstück , 1935 .

[26] Jonathan J. Hu,et al. Parallel multigrid smoothing: polynomial versus Gauss--Seidel , 2003 .

[27] Charles E. Leiserson,et al. Efficient Detection of Determinacy Races in Cilk Programs , 1997, SPAA '97.

[28] S. McCormick,et al. A multigrid tutorial (2nd ed.) , 2000 .

[29] William Hasenplaugh,et al. Parallel algorithms for scheduling data-graph computations , 2016 .

[30] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[31] Srinivas Aluru,et al. A Formal Analysis of Space Filling Curves for Parallel Domain Decomposition , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[32] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[33] Joseph M. Hellerstein,et al. Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[34] J. Ortega,et al. A multi-color SOR method for parallel computation , 1982, ICPP.