A modified two-stage Markov clustering algorithm for large and sparse networks

BACKGROUND Graph-based hierarchical clustering algorithms become prohibitively costly in both execution time and storage space, as the number of nodes approaches the order of millions. OBJECTIVE A fast and highly memory efficient Markov clustering algorithm is proposed to perform the classification of huge sparse networks using an ordinary personal computer. METHODS Improvements compared to previous versions are achieved through adequately chosen data structures that facilitate the efficient handling of symmetric sparse matrices. Clustering is performed in two stages: the initial connected network is processed in a sparse matrix until it breaks into isolated, small, and relatively dense subgraphs, which are then processed separately until convergence is obtained. An intelligent stopping criterion is also proposed to quit further processing of a subgraph that tends toward completeness with equal edge weights. The main advantage of this algorithm is that the necessary number of iterations is separately decided for each graph node. RESULTS The proposed algorithm was tested using the SCOP95 and large synthetic protein sequence data sets. The validation process revealed that the proposed method can reduce 3-6 times the processing time of huge sequence networks compared to previous Markov clustering solutions, without losing anything from the partition quality. CONCLUSIONS A one-million-node and one-billion-edge protein sequence network defined by a BLAST similarity matrix can be processed with an upper-class personal computer in 100 minutes. Further improvement in speed is possible via parallel data processing, while the extension toward several million nodes needs intermediary data storage, for example on solid state drives.

[1]  Webb Miller,et al.  Evaluation of methods for detecting conversion events in gene clusters , 2011, BMC Bioinformatics.

[2]  Peng Jiang,et al.  SPICi: a fast clustering algorithm for large biological networks , 2010, Bioinform..

[3]  John C. Wooley,et al.  Ultrafast clustering algorithms for metagenomic sequence analysis , 2012, Briefings Bioinform..

[4]  László Szilágyi,et al.  A fast hierarchical clustering algorithm for large-scale protein sequence data sets , 2014, Comput. Biol. Medicine.

[5]  Shaogang Gong,et al.  A Markov Clustering Topic Model for mining behaviour in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  László Szilágyi,et al.  Efficient Markov clustering algorithm for protein sequence grouping , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[7]  László Szilágyi,et al.  Recent Advances in Improving the Memory Efficiency of the TRIBE MCL Algorithm , 2015, ICONIP.

[8]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[9]  Charles F. Delwiche,et al.  Erratum to: Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm , 2015, BMC Bioinform..

[10]  László Szilágyi,et al.  A modified Markov clustering approach to unsupervised classification of protein sequences , 2010, Neurocomputing.

[11]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[12]  Marek Gagolewski,et al.  Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm , 2016, Inf. Sci..

[13]  Wynand S. Verwoerd A new computational method to split large biochemical networks into coherent subnets , 2010, BMC Systems Biology.

[14]  Arun Rawat,et al.  Quail Genomics: a knowledgebase for Northern bobwhite , 2010, BMC Bioinformatics.

[15]  Daniel P. W. Ellis,et al.  Detecting local semantic concepts in environmental sounds using Markov model based clustering , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Vincent Miele,et al.  Ultra-fast sequence clustering from similarity networks with SiLiX , 2011, BMC Bioinformatics.

[17]  Nicholas A. Hamilton,et al.  Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA and ELLPACK-R Sparse Format , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[19]  Elon Portugaly,et al.  Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space , 2008, ISMB.

[20]  Qiang Zhou,et al.  Markov clustering based placement algorithm for island-style FPGAs , 2010, The 2010 International Conference on Green Circuits and Systems.

[21]  Kaushal K. Shukla,et al.  Characteristics of restricted neighbourhood search algorithm and Markov clustering on modified power-law distribution , 2012, 2012 1st International Conference on Recent Advances in Information Technology (RAIT).

[22]  Levente Kovács,et al.  Synthetic Test Data Generation for Hierarchical Graph Clustering Methods , 2014, ICONIP.

[23]  S. Dongen Graph clustering by flow simulation , 2000 .

[24]  Jack Y. Yang,et al.  Kernelized partial least squares for feature reduction and classification of gene microarray data , 2011, BMC Systems Biology.

[25]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[26]  Lefteris Angelis,et al.  PuReD-MCL: a graph-based PubMed document clustering methodology , 2008, Bioinform..

[27]  Brian Everitt,et al.  Cluster analysis , 1974 .

[28]  Hui Li,et al.  Unsupervised Human Action Categorization Using Latent Dirichlet Markov Clustering , 2012, 2012 Fourth International Conference on Intelligent Networking and Collaborative Systems.

[29]  Kuldip K. Paliwal,et al.  Gram-positive and gram-negative subcellular localization using rotation forest and physicochemical-based features , 2015, BMC Bioinformatics.

[30]  Tamás Nepusz,et al.  SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale , 2010, BMC Bioinformatics.

[31]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[32]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 2000, Nucleic Acids Res..

[33]  András Kocsor,et al.  A Protein Classification Benchmark collection for machine learning , 2007, Nucleic Acids Res..