Bipartite-Oriented Distributed Graph Partitioning for Big Learning

Many machine learning and data mining (MLDM) problems like recommendation, topic modeling, and medical diagnosis can be modeled as computing on bipartite graphs. However, most distributed graph-parallel systems are oblivious to the unique characteristics in such graphs and existing online graph partitioning algorithms usually cause excessive replication of vertices as well as significant pressure on network communication. This article identifies the challenges and opportunities of partitioning bipartite graphs for distributed MLDM processing and proposes BiGraph, a set of bipartite-oriented graph partitioning algorithms. BiGraph leverages observations such as the skewed distribution of vertices, discriminated computation load and imbalanced data sizes between the two subsets of vertices to derive a set of optimal graph partitioning algorithms that result in minimal vertex replication and network communication. BiGraph has been implemented on PowerGraph and is shown to have a performance boost up to 17.75X (from 1.16X) for four typical MLDM algorithms, due to reducing up to 80% vertex replication, and up to 96% network traffic.

[1]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[3]  Binyu Zang,et al.  Bipartite-oriented distributed graph partitioning for big learning , 2014, APSys.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[6]  Tie-Yan Liu,et al.  Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering , 2005, KDD '05.

[7]  Eric P. Xing,et al.  Fugue: Slow-Worker-Agnostic Distributed Learning for Big Models on Big Data , 2014, AISTATS.

[8]  Theodore L. Willke,et al.  GraphBuilder: scalable graph ETL framework , 2013, GRADES.

[9]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[10]  George Karypis,et al.  Multilevel algorithms for partitioning power-law graphs , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[11]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[12]  Binyu Zang,et al.  Computation and communication efficient graph processing with distributed immutable view , 2014, HPDC '14.

[13]  Tao Qin,et al.  Hierarchical taxonomy preparation for text categorization using consistent bipartite spectral graph copartitioning , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  Charalampos E. Tsourakakis,et al.  FENNEL: streaming graph partitioning for massive scale graphs , 2014, WSDM.

[15]  Vipin Kumar,et al.  Parallel Multilevel Algorithms for Multi-constraint Graph Partitioning (Distinguished Paper) , 2000, Euro-Par.

[16]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[17]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[18]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[19]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[20]  Robert Elsässer,et al.  New spectral bounds on k-partitioning of graphs , 2001, SPAA '01.

[21]  Gabriel Kliot,et al.  Streaming graph partitioning for large distributed graphs , 2012, KDD.