Scalable clustering of modern networks
暂无分享,去创建一个
Graphs (or networks) provide a simple yet powerful model for capturing the interactions or relationships between the entities in many different domains. This dissertation focuses on a fundamental analytical or data management primitive related to graphs - the discovery of natural groups or clusters from a graph, or graph clustering in short. Due to the representational versatility of graphs, this problem has varied applications - the discovery of protein complexes from protein interaction networks, data partitioning for distributed computing, community discovery in online social networks, general data clustering (via similarity-weighted graphs), image segmentation, optimizing transistor layout for VLSI design and so on.
The increasing scale, complexity and diversity of modern (graph) data has rendered classical solutions for this task, such as spectral clustering, either very slow, incomplete or inaccurate. I address this situation in this thesis by approaching the graph clustering problem from two directions - in one direction, I develop algorithms for the core clustering task; in the second direction, I build intelligent pre-processing strategies that transform the graph to enable fast and accurate clustering subsequently. Such a two-pronged strategy allows us to concentrate on the novel aspects of new variants of the clustering problem, while exploiting existing solutions where we can. In the direction of algorithms for the core clustering task, my contribution to the state-of-the-art is a novel algorithm, MLR-MCL, based on the multi-level simulation of stochastic flows. MLR-MCL is a significantly enhanced version of a popular existing algorithm called MCL, and is fast, accurate, noise-tolerant and allows easy adjustment of the balance in the output cluster sizes. In the direction of intelligent pre-processing strategies, I have made three important contributions: (i) Graph sparsification algorithms that clarify the cluster structure and thereby speed-up subsequent clustering, by up to 50x; (ii) Algorithms that symmetrize directed graphs into similarity-weighted, undirected graphs suitable for subsequent clustering; (iii) Novel hashing-based algorithms that efficiently convert general non-graph data into similarity-weighted graphs suitable for clustering.