Clustering Large Probabilistic Graphs

We study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic protein-protein interaction (PPI) networks and discovering groups of users in affiliation networks. We extend the edit-distance-based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameter-free. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges.

[1]  Jianzhong Li,et al.  Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics , 2010, KDD.

[2]  George Kollios,et al.  k-nearest neighbors in uncertain graphs , 2010, Proc. VLDB Endow..

[3]  Jian Pei,et al.  Probabilistic path queries in road networks: traffic uncertainty aware path selection , 2010, EDBT '10.

[4]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[5]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[6]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[7]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[8]  Christopher Ré,et al.  Probabilistic databases: diamonds in the dirt , 2009, CACM.

[9]  Béla Bollobás,et al.  Random Graphs: Notation , 2001 .

[10]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[11]  Vipin Kumar,et al.  Parallel Multilevel series k-Way Partitioning Scheme for Irregular Graphs , 1999, SIAM Rev..

[12]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[13]  Hannu Toivonen,et al.  Finding reliable subgraphs from large probabilistic graphs , 2008, Data Mining and Knowledge Discovery.

[14]  Sudipto Guha,et al.  Exceeding expectations and clustering uncertain data , 2009, PODS.

[15]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[16]  Dmitrij Frishman,et al.  MIPS: analysis and annotation of proteins from whole genomes in 2005 , 2005, Nucleic Acids Res..

[17]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[18]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[19]  Thomas Seidl,et al.  Subspace Clustering for Uncertain Data , 2010, SDM.

[20]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[21]  Xiang Lian,et al.  Efficient query answering in probabilistic RDF graphs , 2011, SIGMOD '11.

[22]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[23]  Graham Cormode,et al.  Approximation algorithms for clustering uncertain data , 2008, PODS.

[24]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[25]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[26]  Christoph Koch,et al.  Approximating predicates and expressive queries on probabilistic databases , 2008, PODS.

[27]  H. Frank,et al.  Shortest Paths in Probabilistic Graphs , 1969, Oper. Res..

[28]  Lei Chen,et al.  Efficiently Answering Probability Threshold-Based Shortest Path Queries over Uncertain Graphs , 2010, DASFAA.

[29]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[30]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[31]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[32]  Reynold Cheng,et al.  Mining uncertain data with probabilistic guarantees , 2010, KDD.

[33]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[34]  Silvio Lattanzi,et al.  Affiliation networks , 2009, STOC '09.

[35]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[36]  Philip S. Yu,et al.  A Framework for Clustering Uncertain Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[37]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[38]  Dimitrios Skoutas,et al.  Efficient discovery of frequent subgraph patterns in uncertain graph databases , 2011, EDBT/ICDT '11.

[39]  Yuval Shavitt,et al.  Approximating the statistics of various properties in randomly weighted graphs , 2009, SODA '11.

[40]  Feifei Li,et al.  Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations , 2008, IEEE Transactions on Knowledge and Data Engineering.

[41]  Jianzhong Li,et al.  Frequent subgraph pattern mining on uncertain graph data , 2009, CIKM.

[42]  Béla Bollobás,et al.  Random Graphs , 1985 .

[43]  K. P. Kaliyamurthie,et al.  Clustering Uncertain Data Using Voronoi Diagrams and R-Tree Index , 2014 .

[44]  Roded Sharan,et al.  Cluster graph modification problems , 2002, Discret. Appl. Math..

[45]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[46]  Ulrik Brandes,et al.  Engineering graph clustering: Models and experimental evaluation , 2008, JEAL.

[47]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[48]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[49]  Yufei Tao,et al.  Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[50]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.