GPU Acceleration of Graph Matching, Clustering, and Partitioning

We consider sequential algorithms for hypergraph partitioning and GPU (i.e., fine-grained shared-memory parallel) algorithms for graph partitioning and clustering. Our investigation into sequential hypergraph partitioning is concerned with the efficient construction of high-quality matchings for hypergraph coarsening and optimisation with respect to general hypergraph partitioning quality metrics. We introduce the l*(l-1)-metric which exactly measures the communication volume for a finite element computation, and show how to use an ordinary hypergraph bipartitioner to greedily optimise a partitioning with respect to a general quality metric. Graph partitioning and clustering on the GPU is achieved by implementing all parts of the multi-level paradigm (i.e., matching, coarsening, and refinement) on the GPU. We first develop GPU algorithms for matching and coarsening. These are then used as building blocks for a greedy agglomerative modularity clustering heuristic, with which we participated in the 10th DIMACS partitioning and clustering challenge. By combining the parallel matching and coarsening algorithms with a parallel partitioning refinement method and implementing these algorithms using general sparse matrix-vector multiplication operations, we are able to perform graph partitioning entirely on the GPU. The GPU partitioning algorithm is compared both in terms of quality and speed to the sequential METIS graph partitioner and is faster for graphs with a million or more vertices, while offering similar quality. The highest achieved speedup over METIS is 6.2, for which a graph with 24 million vertices and 29 million edges is partitioned into two parts in 3.7 seconds on the GPU (an NVIDIA Tesla C2075) with an edge cut of 329. This shows that the GPU can effectively be used for the multi-level analysis of large graphs.

[1]  Rob H. Bisseling,et al.  Cache-Oblivious Sparse Matrix--Vector Multiplication by Using Sparse Matrix Partitioning Methods , 2009, SIAM J. Sci. Comput..

[2]  Gerard L. G. Sleijpen,et al.  A Jacobi-Davidson Iteration Method for Linear Eigenvalue Problems , 1996, SIAM J. Matrix Anal. Appl..

[3]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[4]  Jack Edmonds,et al.  Maximum matching and a polyhedron with 0,1-vertices , 1965 .

[5]  Tamara G. Kolda,et al.  Partitioning Rectangular and Structurally Unsymmetric Sparse Matrices for Parallel Processing , 1999, SIAM J. Sci. Comput..

[6]  W. F. McColl,et al.  General purpose parallel computing , 1993 .

[7]  Vijay V. Vazirani,et al.  NP-Completeness of Some Generalizations of the Maximum Matching Problem , 1982, Inf. Process. Lett..

[8]  Claude Berge,et al.  Graphs and Hypergraphs , 2021, Clustering.

[9]  A. J. Stone,et al.  Logic partitioning , 1966, DAC.

[10]  Jun-Ho Her,et al.  Efficient and scalable parallel graph partitioning , 2008 .

[11]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[12]  Nigel P. Topham,et al.  Performance of the decoupled ACRI-1 architecture: the perfect club , 1995, HPCN Europe.

[13]  A. Reusken,et al.  Numerical Methods for Two-phase Incompressible Flows , 2011 .

[14]  Thomas Lengauer,et al.  Combinatorial algorithms for integrated circuit layout , 1990, Applicable theory in computer science.

[15]  Bruce Hendrickson,et al.  An empirical study of static load balancing algorithms , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[16]  S. Helgason The Radon Transform , 1980 .

[17]  Ümit V. Çatalyürek Hypergraph models for sparse matrix partitioning and reordering , 1999 .

[18]  Ümit V. Çatalyürek,et al.  A repartitioning hypergraph model for dynamic load balancing , 2009, J. Parallel Distributed Comput..

[19]  James N. England A system for interactive modeling of physical curved surface objects , 1978, SIGGRAPH '78.

[20]  Chris Walshaw,et al.  JOSTLE: multilevel graph partitioning software: an overview , 2007 .

[21]  Bruce Hendrickson,et al.  Improving the Run Time and Quality of Nested Dissection Ordering , 1998, SIAM J. Sci. Comput..

[22]  Robert E. Tarjan,et al.  Faster scaling algorithms for general graph matching problems , 1991, JACM.

[23]  J. Radon On the determination of functions from their integral values along certain manifolds , 1986, IEEE Transactions on Medical Imaging.

[24]  Tamara G. Kolda,et al.  Graph partitioning models for parallel computing , 2000, Parallel Comput..

[25]  Rob H. Bisseling,et al.  A GPU Algorithm for Greedy Graph Matching , 2011, Facing the Multicore-Challenge.

[26]  Paul D. Hovland,et al.  Evaluation of Hierarchical Mesh Reorderings , 2009, ICCS.

[27]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[28]  Bernhard Blümich,et al.  Flow Dynamics Measured and Simulated Inside a Single Levitated Droplet , 2006 .

[29]  Rob H. Bisseling,et al.  Routing for analog chip designs at NXP Semiconductors , 2011 .

[30]  Li Ma,et al.  Scalable Community Discovery of Large Networks , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[31]  Rob H. Bisseling,et al.  Parallel hypergraph partitioning for scientific computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[32]  Y. Saad,et al.  Iterative solution of linear systems in the 20th century , 2000 .

[33]  S. Parter The Use of Linear Graphs in Gauss Elimination , 1961 .

[34]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[35]  Jean R. S. Blair,et al.  Experiments on Union-Find Algorithms for the Disjoint-Set Data Structure , 2010, SEA.

[36]  Ulrik Brandes,et al.  On Modularity Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.

[37]  Béla Bollobás,et al.  Random Graphs , 1985 .

[38]  Mark S. Shephard,et al.  Efficient distributed mesh data structure for parallel automated adaptive analysis , 2006, Engineering with Computers.

[39]  Ümit V. Çatalyürek,et al.  Permuting Sparse Rectangular Matrices into Block-Diagonal Form , 2004, SIAM J. Sci. Comput..

[40]  Alex Pothen,et al.  Parallel Distance-k Coloring Algorithms for Numerical Optimization , 2002, Euro-Par.

[41]  H. Martin Bücker,et al.  A Graph Model for Minimizing the Storage Overhead of Distributing Data for the Parallel Solution of Two-Phase Flows , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[42]  Bruce Hendrickson,et al.  A Multi-Level Algorithm For Partitioning Graphs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[43]  Péter Kovács,et al.  LEMON - an Open Source C++ Graph Template Library , 2011, WGT@ETAPS.

[44]  David Avis,et al.  A survey of heuristics for the weighted matching problem , 1983, Networks.

[45]  D. Brandt,et al.  Multi-level adaptive solutions to boundary-value problems math comptr , 1977 .

[46]  Vipin Kumar,et al.  Parallel Multilevel series k-Way Partitioning Scheme for Irregular Graphs , 1999, SIAM Rev..

[47]  Andrew B. Kahng,et al.  Match twice and stitch: a new TSP tour construction heuristic , 2004, Oper. Res. Lett..

[48]  Gene H. Golub,et al.  Matrix computations , 1983 .

[49]  M. Patwary,et al.  Parallel greedy graph matching using an edge partitioning approach , 2010, HLPP '10.

[50]  Brian W. Kernighan,et al.  An efficient heuristic procedure for partitioning graphs , 1970, Bell Syst. Tech. J..

[51]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[52]  Ralph Müller,et al.  A scalable multi‐level preconditioner for matrix‐free µ‐finite element analysis of human bone structures , 2008 .

[53]  Jiang-Hua Lu,et al.  Progress in Mathematics , 2013 .

[54]  Ümit V. Çatalyürek,et al.  A fine-grain hypergraph model for 2D decomposition of sparse matrices , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[55]  B. Mohar THE LAPLACIAN SPECTRUM OF GRAPHS y , 1991 .

[56]  Richard M. Karp,et al.  Maximum Matchings in Sparse Random Graphs , 1981, FOCS 1981.

[57]  Brian W. Kernighan,et al.  A proper model for the partitioning of electrical circuits , 1972, DAC '72.

[58]  Courtenay T. Vaughan,et al.  Zoltan data management services for parallel dynamic applications , 2002, Comput. Sci. Eng..

[59]  Jason L. Mitchell,et al.  Shading in valve's source engine , 2006, SIGGRAPH Courses.

[60]  Marc Olano,et al.  GPU random numbers via the tiny encryption algorithm , 2010, HPG '10.

[61]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[62]  Andrew B. Kahng,et al.  Improved algorithms for hypergraph bipartitioning , 2000, ASP-DAC '00.

[63]  Rob H. Bisseling,et al.  Two-dimensional cache-oblivious sparse matrix-vector multiplication , 2011, Parallel Comput..

[64]  Rob H. Bisseling,et al.  A new metric enabling an exact hypergraph model for the communication volume in distributed-memory parallel applications , 2013, Parallel Comput..

[65]  Pawan Kumar,et al.  Notice of Violation of IEEE Publication Principles The Anatomy of a Large-Scale Hyper Textual Web Search Engine , 2009 .

[66]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[67]  Patricia J. Teller,et al.  Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[68]  Michael Holzrichter,et al.  A Graph Based Method for Generating the Fiedler Vector of Irregular Problems , 1999, IPPS/SPDP Workshops.

[69]  Karen Dragon Devine,et al.  Partitioning and Dynamic Load Balancing for the Numerical Solution of Partial Differential Equations , 2006 .

[70]  R. J. Blake,et al.  A multilevel unsymmetric matrix ordering algorithm for parallel process simulation , 2000 .

[71]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[72]  Jack Dongarra,et al.  Sourcebook of parallel computing , 2003 .

[73]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[74]  Ümit V. Çatalyürek,et al.  Decomposing Irregularly Sparse Matrices for Parallel Matrix-Vector Multiplication , 1996, IRREGULAR.

[75]  Alex Pothen,et al.  What Color Is Your Jacobian? Graph Coloring for Computing Derivatives , 2005, SIAM Rev..

[76]  Anselmo Lastra,et al.  Physically-based visual simulation on graphics hardware , 2002, HWWS '02.

[77]  Rob H. Bisseling,et al.  A Parallel Approximation Algorithm for the Weighted Maximum Matching Problem , 2007, PPAM.

[78]  P. Gilbert Iterative methods for the three-dimensional reconstruction of an object from projections. , 1972, Journal of theoretical biology.

[79]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[80]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[81]  Curt Jones,et al.  Finding Good Approximate Vertex and Edge Partitions is NP-Hard , 1992, Inf. Process. Lett..

[82]  Rob H. Bisseling,et al.  Communication balancing in parallel sparse matrix-vector multiplication , 2005 .

[83]  John K. Reid,et al.  Exploiting zeros on the diagonal in the direct solution of indefinite sparse symmetric linear systems , 1996, TOMS.

[84]  Iain S. Duff,et al.  On Algorithms For Permuting Large Entries to the Diagonal of a Sparse Matrix , 2000, SIAM J. Matrix Anal. Appl..

[85]  Laura A. Sanchis,et al.  Multiple-Way Network Partitioning , 1989, IEEE Trans. Computers.

[86]  Takumi Washio,et al.  A Parallel Multilevel Technique for Solving the Bidomain Equation on a Human Heart with Purkinje Fibers and a Torso Model , 2008, SIAM J. Sci. Comput..

[87]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[88]  H. M. Bücker,et al.  Practical shape optimization of a levitation device for single droplets , 2008 .

[89]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[90]  T. Pan Computed Tomography: from Photon Statistics to Modern Cone-Beam CT , 2009, Journal of Nuclear Medicine.

[91]  François-Henry Rouet,et al.  On Partitioning Problems with Complex Objectives , 2011, Euro-Par Workshops.

[92]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[93]  H. Martin Bücker,et al.  Modeling Data Distribution for Two-Phase Flow Problems by Weighted Graphs , 2010, ARCS Workshops.

[94]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.