HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

Abstract Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. Here, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ∼70 million nodes with ∼68 billion edges in ∼2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.

[1]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[2]  Georgios A. Pavlopoulos,et al.  Empirical Comparison of Visualization Tools for Larger-Scale Network Analysis , 2017, Adv. Bioinformatics.

[3]  Reinhard Schneider,et al.  Using graph theory to analyze biological networks , 2011, BioData Mining.

[4]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[5]  Elon Portugaly,et al.  Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space , 2008, ISMB.

[6]  Alan D. George,et al.  BSW: FPGA-accelerated BLAST-Wrapped Smith-Waterman aligner , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[7]  Anton J. Enright,et al.  Classification schemes for protein structure and function , 2003, Nature Reviews Genetics.

[8]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[9]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[10]  Blatt,et al.  Superparamagnetic clustering of data. , 1998, Physical review letters.

[11]  Jan Baumbach,et al.  Comparing the performance of biomedical clustering methods , 2015, Nature Methods.

[12]  Anton J. Enright,et al.  Visualizing genome and systems biology: technologies, tools, implementation techniques and trends, past, present and future , 2015, GigaScience.

[13]  Lakhmi C. Jain,et al.  Multimedia Services in Intelligent Environments , 2008 .

[14]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[15]  I-Min A. Chen,et al.  IMG/M: integrated genome and metagenome comparative data analysis system , 2016, Nucleic Acids Res..

[16]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[17]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[18]  Stijn van Dongen,et al.  Graph Clustering Via a Discrete Uncoupling Process , 2008, SIAM J. Matrix Anal. Appl..

[19]  Igor Jurisica,et al.  Protein complex prediction via cost-based clustering , 2004, Bioinform..

[20]  Georgios A. Pavlopoulos,et al.  Interpreting the Omics ‘era’ Data , 2013 .

[21]  Nicholas A. Hamilton,et al.  Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA and ELLPACK-R Sparse Format , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[23]  John M. Hancock,et al.  CoGenT++: an extensive and extensible data environment for computational genomics , 2005, Bioinform..

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  Peng Jiang,et al.  SPICi: a fast clustering algorithm for large biological networks , 2010, Bioinform..

[26]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[27]  David Auber,et al.  Tulip - A Huge Graph Visualization Framework , 2004, Graph Drawing Software.

[28]  Gary D. Bader,et al.  clusterMaker: a multi-algorithm clustering plugin for Cytoscape , 2011, BMC Bioinformatics.

[29]  Reinhard Schneider,et al.  A survey of visualization tools for biological network analysis , 2008, BioData Mining.

[30]  Rok Sosic,et al.  SNAP , 2016, ACM Trans. Intell. Syst. Technol..

[31]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[32]  Sushanta Mukhopadhyay,et al.  Recent Advances in Information Technology - RAIT-2014 Proceedings [Dhanbad, India, 13-15 March, 2014] , 2014, RAIT.

[33]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[34]  Reinhard Schneider,et al.  jClust: a clustering and visualization toolbox , 2009, Bioinform..

[35]  Mile Sikic,et al.  SWORD - a highly efficient protein database search , 2015, bioRxiv.

[36]  Shoshana J. Wodak,et al.  Markov clustering versus affinity propagation for the partitioning of protein interaction graphs , 2009, BMC Bioinformatics.

[37]  Natalia N. Ivanova,et al.  Microbiome Data Science: Understanding Our Microbial Planet. , 2016, Trends in microbiology.

[38]  S. Dongen Graph clustering by flow simulation , 2000 .

[39]  Georgios A. Pavlopoulos,et al.  NAP: The Network Analysis Profiler, a web tool for easier topological analysis and comparison of medium-scale biological networks , 2017, BMC Research Notes.

[40]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[41]  Baruch Awerbuch,et al.  New Connectivity and MSF Algorithms for Shuffle-Exchange Network and PRAM , 1987, IEEE Transactions on Computers.

[42]  James Demmel,et al.  Parallel Reproducible Summation , 2015, IEEE Transactions on Computers.

[43]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[44]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[45]  Yongdong Zhang,et al.  H‐BLAST: a fast protein sequence alignment toolkit on heterogeneous computers with GPUs , 2017, Bioinform..

[46]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[47]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[48]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[49]  Anton J. Enright,et al.  Detection of functional modules from protein interaction networks , 2003, Proteins.

[50]  Stijn van Dongen,et al.  Construction, Visualisation, and Clustering of Transcription Networks from Microarray Expression Data , 2007, PLoS Comput. Biol..

[51]  Thomas L. Madden,et al.  Domain enhanced lookup time accelerated BLAST , 2012, Biology Direct.

[52]  Reinhard Schneider,et al.  Medusa: A tool for exploring and clustering biological networks , 2011, BMC Research Notes.

[53]  Reinhard Schneider,et al.  Which clustering algorithm is better for predicting protein complexes? , 2011, BMC Research Notes.