Fast Detection of Overlapping Communities via Online Tensor Methods on GPUs

We present a fast tensor-based approach for detecting hidden overlapping communities under the Mixed Membership Stochastic Blockmodel (MMSB). We present two implementations, \viz a GPU-based implementation which exploits the parallelism of SIMD architectures and a CPU-based implementation for larger datasets, wherein the GPU memory does not suffice. Our GPU-based implementation involves a careful optimization of storage, data transfer and matrix computations. Our CPU-based implementation involves sparse linear algebraic operations which exploit the data sparsity. We use stochastic gradient descent for multilinear spectral optimization and this allows for flexibility in the tradeoff between node sub-sampling and accuracy of the results. We validate our results on datasets from Facebook, Yelp and DBLP where ground truth is available, using notions of $p$-values and false discovery rates, and obtain high accuracy for membership recovery. We compare our results, both in terms of execution time and accuracy, to the state-of-the-art algorithms such as the variational method, and report many orders of magnitude gain in the execution time. The tensor method is also applicable for unsupervised learning of a wide range of latent variable models, and we also demonstrate efficient recovery of topics from the Nytimes dataset.

[1]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[2]  E. Oja,et al.  On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix , 1985 .

[3]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[4]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[5]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[6]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[7]  John Langford,et al.  An objective evaluation criterion for clustering , 2004, KDD.

[8]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[9]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[10]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[11]  T. Nepusz,et al.  Fuzzy communities and the concept of bridgeness in complex networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Korbinian Strimmer,et al.  fdrtool: a versatile R package for estimating local and tail area-based false discovery rates , 2008, Bioinform..

[13]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[14]  Andrea Lancichinetti,et al.  Detecting the overlapping and hierarchical community structure in complex networks , 2008, 0802.1218.

[15]  Andrea Lancichinetti,et al.  Community detection algorithms: a comparative analysis: invited presentation, extended abstract , 2009, VALUETOOLS.

[16]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[17]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[18]  David B. Dunson,et al.  Probabilistic topic models , 2012, Commun. ACM.

[19]  Ankur Narang,et al.  Fast Community Detection Algorithm with GPUs and Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[20]  Mason A. Porter,et al.  Comparing Community Structure to Characteristics in Online Collegiate Social Networks , 2008, SIAM Rev..

[21]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[22]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Tamara G. Kolda,et al.  Efficiently Computing Tensor Eigenvalues on a GPU , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[24]  David F. Gleich,et al.  Tall and skinny QR factorizations in MapReduce architectures , 2011, MapReduce '11.

[25]  Nathan Srebro,et al.  Stochastic optimization for PCA and PLS , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[26]  Michael J. Freedman,et al.  Scalable Inference of Overlapping Communities , 2012, NIPS.

[27]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, KDD 2012.

[28]  Sujay Sanghavi,et al.  Clustering Sparse Graphs , 2012, NIPS.

[29]  Dit-Yan Yeung,et al.  Overlapping community detection via bounded nonnegative matrix tri-factorization , 2012, KDD.

[30]  Joel A. Tropp,et al.  Robust computation of linear models, or How to find a needle in a haystack , 2012, ArXiv.

[31]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[32]  Anima Anandkumar,et al.  A Tensor Spectral Approach to Learning Mixed Membership Community Models , 2013, COLT.

[33]  David M Blei,et al.  Efficient discovery of overlapping communities in massive networks , 2013, Proceedings of the National Academy of Sciences.

[34]  Tze Meng Low,et al.  Exploiting Symmetry in Tensors for High Performance , 2013, ArXiv.

[35]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[36]  Jure Leskovec,et al.  Overlapping community detection at scale: a nonnegative matrix factorization approach , 2013, WSDM.

[37]  B. Fadem High-yield behavioral science / , 2013 .

[38]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[39]  Santosh S. Vempala,et al.  Principal Component Analysis and Higher Correlations for Distributed Data , 2013, COLT.