Graph Clustering Via a Discrete Uncoupling Process

A discrete uncoupling process for finite spaces is introduced, called the Markov Cluster Process or the MCL process. The process is the engine for the graph clustering algorithm called the MCL algorithm. The MCL process takes a stochastic matrix as input, and then alternates expansion and inflation, each step defining a stochastic matrix in terms of the previous one. Expansion corresponds with taking the $k$th power of a stochastic matrix, where $k\in\N$. Inflation corresponds with a parametrized operator $\Gamma_r$, $r\geq 0$, that maps the set of (column) stochastic matrices onto itself. The image $\Gamma_r M$ is obtained by raising each entry in $M$ to the $r$th power and rescaling each column to have sum 1 again. In practice the process converges very fast towards a limit that is invariant under both matrix multiplication and inflation, with quadratic convergence around the limit points. The heuristic behind the process is its expected behavior for (Markov) graphs possessing cluster structure. The process is typically applied to the matrix of random walks on a given graph $G$, and the connected components of (the graph associated with) the process limit generically allow a clustering interpretation of $G$. The limit is in general extremely sparse and iterands are sparse in a weighted sense, implying that the MCL algorithm is very fast and highly scalable. Several mathematical properties of the MCL process are established. Most notably, the process (and algorithm) iterands posses structural properties generalizing the mapping from process limits onto clusterings. The inflation operator $\Gamma_r$ maps the class of matrices that are diagonally similar to a symmetric matrix onto itself. The phrase diagonally positive semi-definite (dpsd) is used for matrices that are diagonally similar to a positive semi-definite matrix. For $r\in\N$ and for $M$ a stochastic dpsd matrix, the image $\Gamma_r M$ is again dpsd. Determinantal inequalities satisfied by a dpsd matrix $M$ imply a natural ordering among the diagonal elements of $M$, generalizing the mapping of process limits onto clusterings. The spectrum of $\Gamma_{\infty} M$ is of the form $\{0^{n-k}, 1^k\}$, where $k$ is the number of endclasses of the ordering associated with $M$, and $n$ is the dimension of $M$. This attests to the uncoupling effect of the inflation operator.

[1]  Shoshana J. Wodak,et al.  ACLAME: A CLAssification of Mobile genetic Elements , 2004, Nucleic Acids Res..

[2]  Anton J. Enright,et al.  Protein families and TRIBES in genome sequence space. , 2003, Nucleic acids research.

[3]  Hans Schneider,et al.  Minimization of norms and the spectral radius of a sum of nonnegative matrices under diagonal equivalence , 1996 .

[4]  Kwan-Liu Ma,et al.  Discovering parametric clusters in social small-world graphs , 2005, SAC '05.

[5]  Neil Hall,et al.  A transcriptomic analysis of the phylum Nematoda , 2004, Nature Genetics.

[6]  Dominic Widdows,et al.  Using Curvature and Markov Clustering in Graphs for Lexical Acquisition and Word Sense Discrimination , 2004 .

[7]  T. Elfving On some methods for entropy maximization and matrix scaling , 1980 .

[8]  Anton J. Enright,et al.  Detection of functional modules from protein interaction networks , 2003, Proteins.

[9]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[10]  M. Lewin On nonnegative matrices , 1971 .

[11]  Timothy J. Harlow,et al.  A hybrid clustering approach to recognition of protein families in 114 microbial genomes , 2004, BMC Bioinformatics.

[12]  M. Fiedler Special matrices and their applications in numerical mathematics , 1986 .

[13]  J. W. Spellmann,et al.  Diagonal similarity of irreducible matrices to row stochastic matrices , 1972 .

[14]  B. Dujon,et al.  Genome evolution in yeasts , 2004, Nature.

[15]  J. Hajnal,et al.  On products of non-negative matrices , 1976, Mathematical Proceedings of the Cambridge Philosophical Society.

[16]  I. Olkin,et al.  Inequalities: Theory of Majorization and Its Applications , 1980 .

[17]  C. D. Meyer Uncoupling the Perron eigenvector problem , 1989 .

[18]  David L. Steffen,et al.  The genome of the social amoeba Dictyostelium discoideum , 2005, Nature.

[19]  Joel E. Cohen,et al.  Contractive inhomogeneous products of non-negative matrices , 1979, Mathematical Proceedings of the Cambridge Philosophical Society.

[20]  Lakshmish Ramaswamy,et al.  Connectivity based node clustering in decentralized peer-to-peer networks , 2003, Proceedings Third International Conference on Peer-to-Peer Computing (P2P2003).

[21]  Robert J. Plemmons,et al.  Nonnegative Matrices in the Mathematical Sciences , 1979, Classics in Applied Mathematics.

[22]  Henry P. Wynn,et al.  Cyclic majorization and smoothing operators , 1996 .

[23]  Yong Zhang,et al.  SPD—a web-based secreted protein database , 2004, Nucleic Acids Res..

[24]  Dominic Widdows,et al.  Discovering Corpus-Specific Word Senses , 2003, EACL.

[25]  Thomas Huber,et al.  Phosphoregulators: protein kinases and protein phosphatases of mouse. , 2003, Genome research.

[26]  Alistair Sinclair,et al.  Algorithms for Random Generation and Counting: A Markov Chain Approach , 1993, Progress in Theoretical Computer Science.

[27]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[28]  Anders Sjöstedt,et al.  The complete genome sequence of Francisella tularensis, the causative agent of tularemia , 2005, Nature Genetics.

[29]  Beresford N. Parlett,et al.  Invariant subspaces for tightly clustered eigenvalues of tridiagonals , 1996 .

[30]  Raphael Loewy,et al.  Principal minors and diagonal similarity of matrices , 1986 .

[31]  E. Seneta Non-negative Matrices and Markov Chains (Springer Series in Statistics) , 1981 .

[32]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[33]  Carl D. Meyer,et al.  On the structure of stochastic matrices with a subdominant eigenvalue near 1 , 1998 .

[34]  R. Durbin,et al.  The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics , 2003, PLoS biology.

[35]  Franz Rendl,et al.  A computational study of graph partitioning , 1994, Math. Program..

[37]  B. S. Everitt,et al.  Cluster analysis , 2014, Encyclopedia of Social Network Analysis and Mining.

[38]  David A. Lee,et al.  EyeSite: a semi-automated database of protein families in the eye , 2004, Nucleic Acids Res..

[39]  Christine Orengo,et al.  Target Selection and Determination of Function in Structural Genomics , 2003, IUBMB life.

[40]  S. vanDongen Graph Clustering by Flow Simulation , 2000 .

[41]  C Fitzgerald On fractional Hadamard powers of positive definite matrices*1, *2 , 1977 .

[42]  John R Yates,et al.  A Comprehensive Survey of the Plasmodium Life Cycle by Genomic, Transcriptomic, and Proteomic Analyses , 2005, Science.

[43]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[44]  S. Dongen A cluster algorithm for graphs , 2000 .

[45]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[46]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[47]  P. Bushell Hilbert's metric and positive contraction mappings in a Banach space , 1973 .