Fast Parallel Construction of Correlation Similarity Matrices for Gene Co-Expression Networks on Multicore Clusters

Abstract Gene co-expression networks are gaining attention in the present days as useful representations of biologically interesting interactions among genes. The most computationally demanding step to generate these networks is the construction of the correlation similarity matrix, as all pairwise combinations must be analyzed and complexity increases quadratically with the number of genes. In this paper we present MPICorMat, a hybrid MPI/OpenMP parallel approach to construct similarity matrices based on Pearson’s correlation. It is based on a previous tool (RMTGeneNet) that has been used on several biological studies and proved accurate. Our tool obtains the same results as RMTGeneNet but significantly reduces runtime on multicore clusters. For instance, MPICorMat generates the correlation matrix of a dataset with 61,170 genes and 160 samples in less than one minute using 16 nodes with two Intel Xeon Sandy-Bridge processors each (256 total cores), while the original tool needed almost 4.5 hours. The tool is also compared to another available approach to construct correlation matrices on multicore clusters, showing better scalability and performance. MPICorMat is an open-source software and it is publicly available at https://sourceforge.net/projects/mpicormat/ .

[1]  Katherine A. Yelick,et al.  A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[2]  Victor E. Malyshkin Peculiarities of numerical algorithms parallel implementation for exa-flops multicomputers , 2014, Int. J. Big Data Intell..

[3]  Feng Luo,et al.  Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory , 2007, BMC Bioinformatics.

[4]  Lin Song,et al.  Comparison of co-expression measures: mutual information, correlation, and model based indices , 2012, BMC Bioinformatics.

[5]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[6]  Jiang Li,et al.  Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data , 2013, PloS one.

[7]  Nidhi Rawat,et al.  Construction of citrus gene coexpression networks from microarray data using random matrix theory , 2015, Horticulture Research.

[8]  E. Wigner Random Matrices in Physics , 1967 .

[9]  Alberto de la Fuente,et al.  Discovery of meaningful associations in genomic data using partial correlation coefficients , 2004, Bioinform..

[10]  Tony Pan,et al.  Parallel Pairwise Correlation Computation on Intel Xeon Phi Clusters , 2016, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[11]  Stephen P. Ficklin,et al.  Massive-Scale Gene Co-Expression Network Construction and Robustness Testing Using Random Matrix Theory , 2013, PloS one.

[12]  Brian Gough,et al.  GNU Scientific Library Reference Manual - Third Edition , 2003 .

[13]  Futao Zhang,et al.  FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks , 2015, PloS one.

[14]  Stephen P. Ficklin,et al.  A Systems-Genetics Approach and Data Mining Tool to Assist in the Discovery of Genes Underlying Complex Traits in Oryza sativa , 2013, PloS one.

[15]  Weiguo Liu,et al.  Parallel mutual information estimation for inferring gene regulatory networks on GPUs , 2011, BMC Research Notes.

[16]  Bertil Schmidt,et al.  ParDRe: faster parallel duplicated reads removal tool for sequencing studies , 2016, Bioinform..

[17]  Carsten O. Daub,et al.  Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data , 2004, BMC Bioinformatics.

[18]  M. Gerstein,et al.  Getting connected: analysis and principles of biological networks. , 2007, Genes & development.

[19]  A. Bittner,et al.  Comparison of RNA-Seq and Microarray in Transcriptome Profiling of Activated T Cells , 2014, PloS one.

[20]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[21]  Yongchao Liu,et al.  MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems , 2016, Bioinform..

[22]  Srinivas Aluru,et al.  Parallel Information-Theory-Based Construction of Genome-Wide Gene Regulatory Networks , 2010, IEEE Transactions on Parallel and Distributed Systems.

[23]  Srinivas Aluru,et al.  Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel®Xeon Phi™ Coprocessor , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.