A Null Model for Pearson Coexpression Networks

Gene coexpression networks inferred by correlation from high-throughput profiling such as microarray data represent simple but effective structures for discovering and interpreting linear gene relationships. In recent years, several approaches have been proposed to tackle the problem of deciding when the resulting correlation values are statistically significant. This is most crucial when the number of samples is small, yielding a non-negligible chance that even high correlation values are due to random effects. Here we introduce a novel hard thresholding solution based on the assumption that a coexpression network inferred by randomly generated data is expected to be empty. The threshold is theoretically derived by means of an analytic approach and, as a deterministic independent null model, it depends only on the dimensions of the starting data matrix, with assumptions on the skewness of the data distribution compatible with the structure of gene expression levels data. We show, on synthetic and array datasets, that the proposed threshold is effective in eliminating all false positive links, with an offsetting cost in terms of false negative detected edges.

[1]  Kengo Kinoshita,et al.  COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems , 2014, Nucleic Acids Res..

[2]  Eleazar Eskin,et al.  Mixed-model coexpression: calculating gene coexpression while accounting for expression heterogeneity , 2011, Bioinform..

[3]  G. Hey A NEW METHOD OF EXPERIMENTAL SAMPLING ILLUSTRATED ON CERTAIN NON-NORMAL POPULATIONS , 1938 .

[4]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[5]  Guy Karlebach,et al.  Modelling and analysis of gene regulatory networks , 2008, Nature Reviews Molecular Cell Biology.

[6]  M. Peitsch,et al.  Verification of systems biology research in the age of collaborative competition , 2011, Nature Biotechnology.

[7]  Ingram Olkin,et al.  Unbiased Estimation of Certain Correlation Coefficients , 1958 .

[8]  Carsten Denkert,et al.  New network topology approaches reveal differential correlation patterns in breast cancer , 2013, BMC Systems Biology.

[9]  Petre Caraiani,et al.  Using Complex Networks to Characterize International Business Cycles , 2013, PloS one.

[10]  E. Miska,et al.  A study of Caenorhabditis elegans DAF-2 mutants by metabolomics and differential correlation networks. , 2013, Molecular bioSystems.

[11]  Matej Oresic,et al.  Systematic construction of gene coexpression networks with applications to human T helper cell differentiation process , 2007, Bioinform..

[12]  K. Hamza The smallest uniform upper bound on the distance between the mean and the median of the binomial and Poisson distributions , 1995 .

[13]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[14]  Aidong Zhang,et al.  Advanced Analysis of Gene Expression Microarray Data , 2006, Science, Engineering, and Biology Informatics.

[15]  Melissa J. Davis,et al.  Gene regulatory network inference: evaluation and application to ovarian cancer allows the prioritization of drug targets , 2012, Genome Medicine.

[16]  S. S. Wilks,et al.  The Advanced Theory of Statistics. I. Distribution Theory , 1959 .

[17]  Tijana Milenkovic,et al.  Networks' characteristics are important for systems biology , 2014, Network Science.

[18]  A. Fukushima DiffCorr: an R package to analyze and visualize differential correlations in biological networks. , 2013, Gene.

[19]  Cesare Furlanello,et al.  Stability Indicators in Network Reconstruction , 2012, PloS one.

[20]  D. Altman,et al.  Multiple significance tests: the Bonferroni method , 1995, BMJ.

[21]  Ker-Chau Li,et al.  A system for enhancing genome-wide coexpression dynamics study. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Ana Kozomara,et al.  miRBase: integrating microRNA annotation and deep-sequencing data , 2010, Nucleic Acids Res..

[23]  R. W. Blackmor,et al.  A Course in Theoretical Statistics , 1970 .

[24]  Rajeev Aurora,et al.  Global transcriptomic analysis of Cyanothece 51142 reveals robust diurnal oscillation of central metabolic processes , 2008, Proceedings of the National Academy of Sciences.

[25]  Li-Huei Tsai,et al.  Cdk5 deregulation in the pathogenesis of Alzheimer's disease. , 2004, Trends in molecular medicine.

[26]  Kengo Kinoshita,et al.  COXPRESdb: a database of coexpressed gene networks in mammals , 2007, Nucleic Acids Res..

[27]  Gilles Celeux,et al.  Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models , 2015, Bioinform..

[28]  Karl Pearson,et al.  On the Distribution of the Correlation Coefficient in Small Samples. Appendix II to the Papers of "Student" and R. A. Fisher , 1917 .

[29]  Qingfeng Song,et al.  Co-expression analysis of differentially expressed genes in hepatitis C virus-induced hepatocellular carcinoma , 2014, Molecular medicine reports.

[30]  Etienne Sibille,et al.  Differentially Expressed Genes in Major Depression Reside on the Periphery of Resilient Gene Coexpression Networks , 2011, Front. Neurosci..

[31]  F. Azuaje Selecting biologically informative genes in co-expression networks with a centrality score , 2014, Biology Direct.

[32]  L. López-Kleine,et al.  Biostatistical approaches for the reconstruction of gene co-expression networks based on transcriptomic data. , 2013, Briefings in functional genomics.

[33]  R. Fisher 014: On the "Probable Error" of a Coefficient of Correlation Deduced from a Small Sample. , 1921 .

[34]  Stephen P. Ficklin,et al.  Massive-Scale Gene Co-Expression Network Construction and Robustness Testing Using Random Matrix Theory , 2013, PloS one.

[35]  Cesare Furlanello,et al.  A Machine Learning Pipeline for Discriminant Pathways Identification , 2011, CIBB.

[36]  T. Perneger What's wrong with Bonferroni adjustments , 1998, BMJ.

[37]  Tom C Freeman,et al.  Coexpression analysis of large cancer datasets provides insight into the cellular phenotypes of the tumour microenvironment , 2013, BMC Genomics.

[38]  Oliver Ebenhöh,et al.  Measuring correlations in metabolomic networks with mutual information. , 2008, Genome informatics. International Conference on Genome Informatics.

[39]  Gary D. Bader,et al.  The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function , 2010, Nucleic Acids Res..

[40]  Mathematisch-Naturwissenschaftlichen Fakultat,et al.  Approaches to analyse and interpret biological profile data , 2006 .

[41]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[42]  P. R. Rider,et al.  ON THE DISTRIBUTION OF THE CORRELATION COEFFICIENT IN SMALL SAMPLES , 1932 .

[43]  Futao Zhang,et al.  FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks , 2015, PloS one.

[44]  Hugo Y. K. Lam,et al.  Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes , 2012, Cell.

[45]  Weixiong Zhang,et al.  A general co-expression network-based approach to gene expression analysis: comparison and applications , 2010, BMC Systems Biology.

[46]  Vasileios Stathias,et al.  Identifying Glioblastoma Gene Networks Based on Hypergeometric Test Analysis , 2014, PLoS ONE.

[47]  Staffan Persson,et al.  Co-expression tools for plant biology: opportunities for hypothesis generation and caveats. , 2009, Plant, cell & environment.

[48]  Siddharth Pratap,et al.  Shotgun proteomic analysis of human head and neck squamous cell carcinoma cell line SQ20B with diminished AHSG expression , 2014, BMC Bioinformatics.

[49]  S. Li Concise Formulas for the Area and Volume of a Hyperspherical Cap , 2011 .

[50]  Carlos Prieto,et al.  Human Gene Coexpression Landscape: Confident Network Derived from Tissue Transcriptomic Profiles , 2008, PloS one.

[51]  Hyman M. Schipper,et al.  Transcriptional profiling of Alzheimer blood mononuclear cells by microarray , 2007, Neurobiology of Aging.

[52]  Korbinian Strimmer,et al.  From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data , 2007, BMC Systems Biology.

[53]  Richard D. Smith,et al.  Protein co-expression network analysis (ProCoNA) , 2013, Journal of Clinical Bioinformatics.

[54]  A. Bonner,et al.  Genome-wide network model capturing seed germination reveals coordinated regulation of plant cellular phase transitions , 2011, Proceedings of the National Academy of Sciences.

[55]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[56]  Kevin W. Boyack,et al.  Cluster stability and the use of noise in interpretation of clustering , 2001, IEEE Symposium on Information Visualization, 2001. INFOVIS 2001..

[57]  M. Gerstein,et al.  Relating whole-genome expression data with protein-protein interactions. , 2002, Genome research.

[58]  Riet De Smet,et al.  Advantages and limitations of current network inference methods , 2010, Nature Reviews Microbiology.

[59]  Zhi-Liang Zheng,et al.  Transcriptome comparison and gene coexpression network analysis provide a systems view of citrus response to ‘Candidatus Liberibacter asiaticus’ infection , 2013, BMC Genomics.

[60]  Yike Guo,et al.  Optimising parallel R correlation matrix calculations on gene expression data using MapReduce , 2014, BMC Bioinformatics.

[61]  Enrico Petretto,et al.  Leveraging gene co-expression networks to pinpoint the regulation of complex traits and disease, with a focus on cardiovascular traits. , 2014, Briefings in functional genomics.

[62]  Emerson M. Pugh,et al.  The analysis of physical measurements , 1966 .

[63]  Xiangfeng Wang,et al.  Application of the Gini Correlation Coefficient to Infer Regulatory Relationships in Transcriptome Analysis[W][OA] , 2012, Plant Physiology.

[64]  T. Ideker,et al.  Differential network biology , 2012, Molecular systems biology.

[65]  Michael A. Langston,et al.  Threshold selection in gene co-expression networks using spectral graph theory techniques , 2009, BMC Bioinformatics.

[66]  Matthias Scholz,et al.  Approaches toanalyse and interpret biological profile data , 2006 .

[67]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[68]  P. Pavlidis,et al.  Meta-analysis of gene coexpression networks in the post-mortem prefrontal cortex of patients with schizophrenia and unaffected controls , 2013, BMC Neuroscience.

[69]  A. Gayen,et al.  The frequency distribution of the product-moment correlation coefficient in random samples of any size drawn from non-normal universes. , 1951, Biometrika.

[70]  K Dempsey,et al.  A Novel Correlation Networks Approach for the Identification of Gene Targets , 2011, 2011 44th Hawaii International Conference on System Sciences.

[71]  Stephanie Roessler,et al.  MicroRNA expression, survival, and response to interferon in liver cancer. , 2009, The New England journal of medicine.

[72]  L. Varona,et al.  Modeling Skewness in Human Transcriptomes , 2012, PloS one.

[73]  Markus Neuhäuser,et al.  Permutation Tests , 2011, International Encyclopedia of Statistical Science.

[74]  Yang Xiang,et al.  Weighted Frequent Gene Co-expression Network Mining to Identify Genes Involved in Genome Stability , 2012, PLoS Comput. Biol..

[75]  T. Roskams,et al.  Pancreatic cancer circulating tumour cells express a cell motility gene signature that predicts survival after surgery , 2012, BMC Cancer.

[76]  Cesare Furlanello,et al.  The HIM glocal metric and kernel for network comparison and classification , 2012, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[77]  Egon S. Pearson,et al.  THE DISTRIBUTION OF FREQUENCY CONSTANTS IN SMALL SAMPLES FROM NON-NORMAL SYMMETRICAL AND SKEW POPULATIONS , 1929 .

[78]  Giorgio Fotia,et al.  Inferring Gene Networks: Dream or Nightmare? , 2009, Annals of the New York Academy of Sciences.

[79]  Cesare Furlanello,et al.  A promoter-level mammalian expression atlas , 2015 .

[80]  J. Natarajan,et al.  Computational Identification of Alzheimer's Disease Specific Transcription Factors using Microarray Gene Expression Data , 2009 .

[81]  Vipin Kumar,et al.  Co-clustering phenome–genome for phenotype classification and disease gene discovery , 2012, Nucleic acids research.

[82]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[83]  J B S HALDANE A note on non-normal correlation. , 1949, Biometrika.

[84]  Lin Song,et al.  Comparison of co-expression measures: mutual information, correlation, and model based indices , 2012, BMC Bioinformatics.

[85]  Raya Khanin,et al.  Methods of Microarray Data Analysis V , 2007 .

[86]  Leng Han,et al.  Gene co-expression network analysis reveals common system-level properties of prognostic genes across cancer types , 2014, Nature Communications.

[87]  Sapna Kumari,et al.  Evaluation of Gene Association Methods for Coexpression Network Construction and Biological Knowledge Discovery , 2012, PloS one.

[88]  Sylvia Richardson,et al.  Statistical Applications in Genetics and Molecular Biology Comparing the Characteristics of Gene Expression Profiles Derived by Univariate and Multivariate Classification Methods , 2011 .

[89]  O. Maes,et al.  Methodology for discovery of Alzheimer's disease blood-based biomarkers. , 2009, The journals of gerontology. Series A, Biological sciences and medical sciences.

[90]  Alfred O. Hero,et al.  High Throughput Screening of Co-Expressed Gene Pairs with Controlled False Discovery Rate (FDR) and Minimum Acceptable Strength (MAS) , 2005, J. Comput. Biol..

[91]  Ron Shamir,et al.  Network-induced Classification Kernels for Gene Expression Profile Analysis , 2012 .

[92]  Krista A. Zanetti,et al.  Identification of metastasis‐related microRNAs in hepatocellular carcinoma , 2008, Hepatology.

[93]  R. Khanin,et al.  Construction of Malaria Gene Expression Network Using Partial Correlations , 2007 .

[94]  Min Chen,et al.  Comparing Statistical Methods for Constructing Large Scale Gene Networks , 2012, PloS one.

[95]  Hongzhe Li,et al.  Network-based analysis of multivariate gene expression data. , 2013, Methods in molecular biology.

[96]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[97]  Thorsten Dickhaus,et al.  Simultaneous Statistical Inference , 2014, Springer Berlin Heidelberg.

[98]  Feng Luo,et al.  Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory , 2007, BMC Bioinformatics.

[99]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[100]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[101]  Hong-Qiang Wang,et al.  CorSig: A General Framework for Estimating Statistical Significance of Correlation and Its Application to Gene Co-Expression Analysis , 2013, PloS one.

[102]  Jonathan Friedman,et al.  Inferring Correlation Networks from Genomic Survey Data , 2012, PLoS Comput. Biol..

[103]  Markus Perola,et al.  An Immune Response Network Associated with Blood Lipid Levels , 2010, PLoS genetics.

[104]  Stephen P. Ficklin,et al.  Maximizing capture of gene co-expression relationships through pre-clustering of input expression samples: an Arabidopsis case study , 2013, BMC Systems Biology.

[105]  C. Kowalski On the Effects of Non‐Normality on the Distribution of the Sample Product‐Moment Correlation Coefficient , 1972 .

[106]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[107]  S. Folstein,et al.  "Mini-mental state". A practical method for grading the cognitive state of patients for the clinician. , 1975, Journal of psychiatric research.

[108]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[109]  D J PRICE,et al.  NETWORKS OF SCIENTIFIC PAPERS. , 1965, Science.

[110]  D. G. Beech,et al.  The Advanced Theory of Statistics. Volume 2: Inference and Relationship. , 1962 .

[111]  A. Barabasi,et al.  The network takeover , 2011, Nature Physics.

[112]  Ao Yuan,et al.  Global pattern of pairwise relationship in genetic network. , 2010, Journal of biomedical science and engineering.

[113]  Khodakarim Soheila,et al.  Comparison of Univariate and Multivariate Gene Set Analysis in Acute Lymphoblastic Leukemia , 2013 .

[114]  Rupert G. Miller Simultaneous Statistical Inference , 1966 .

[115]  A. Sparks,et al.  The Genomic Landscapes of Human Breast and Colorectal Cancers , 2007, Science.

[116]  Alberto de la Fuente,et al.  Inferring Gene Networks: Dream or Nightmare? , 2009, Annals of the New York Academy of Sciences.

[117]  Michael Griffin,et al.  Gene co-expression network topology provides a framework for molecular characterization of cellular state , 2004, Bioinform..

[118]  J. Kenney,et al.  Mathematics of statistics , 1940 .

[119]  Arnold M Saxton,et al.  Comparison of threshold selection methods for microarray gene co-expression matrices , 2009, BMC Research Notes.

[120]  Ron Shamir,et al.  Dissection of Regulatory Networks that Are Altered in Disease via Differential Co-expression , 2013, PLoS Comput. Biol..

[121]  Christina Kluge,et al.  Data Reduction And Error Analysis For The Physical Sciences , 2016 .

[122]  Julio R. Banga,et al.  Inference of complex biological networks: distinguishability issues and optimization-based solutions , 2011, BMC Systems Biology.

[123]  R. Tothill,et al.  Novel Molecular Subtypes of Serous and Endometrioid Ovarian Cancer Linked to Clinical Outcome , 2008, Clinical Cancer Research.

[124]  Emmanuel Barillot,et al.  Classification of microarray data using gene networks , 2007, BMC Bioinformatics.

[125]  S. Horvath,et al.  Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks , 2006, BMC Genomics.

[126]  Frank Emmert-Streib,et al.  Gene Sets Net Correlations Analysis (GSNCA): a multivariate differential coexpression test for gene sets , 2013, Bioinform..

[127]  E. Masliah,et al.  Modulation of aberrant CDK5 signaling rescues impaired neurogenesis in models of Alzheimer's disease , 2011, Cell Death and Disease.

[128]  N. D. Clarke,et al.  Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges , 2010, PloS one.

[129]  Teresa M. Przytycka,et al.  Chapter 5: Network Biology Approach to Complex Diseases , 2012, PLoS Comput. Biol..

[130]  R. Fisher FREQUENCY DISTRIBUTION OF THE VALUES OF THE CORRELATION COEFFIENTS IN SAMPLES FROM AN INDEFINITELY LARGE POPU;ATION , 1915 .

[131]  S. Dudoit,et al.  Multiple Testing Procedures with Applications to Genomics , 2007 .

[132]  Karl Pearson,et al.  ON THE DISTRIBUTION OF THE CORRELATION COEFFICIENT IN SMALL SAMPLES. APPENDIX II TO THE PAPERS OF “STUDENT” AND R. A. FISHER. A COOPERATIVE STUDY , 1917 .

[133]  N. D. Clarke,et al.  Correction: Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges , 2010, PLoS ONE.

[134]  Feng Q. He,et al.  Reverse engineering and verification of gene networks: principles, assumptions, and limitations of present methods and future perspectives. , 2009, Journal of biotechnology.