A mixture model to detect edges in sparse co-expression graphs with an application for comparing breast cancer subtypes.

We develop a method to recover a gene network's structure from co-expression data, measured in terms of normalized Pearson's correlation coefficients between gene pairs. We treat these co-expression measurements as weights in the complete graph in which nodes correspond to genes. To decide which edges exist in the gene network, we fit a three-component mixture model such that the observed weights of 'null edges' follow a normal distribution with mean 0, and the non-null edges follow a mixture of two lognormal distributions, one for positively- and one for negatively-correlated pairs. We show that this so-called L2 N mixture model outperforms other methods in terms of power to detect edges, and it allows to control the false discovery rate. Importantly, our method makes no assumptions about the true network structure. We demonstrate our method, which is implemented in an R package called edgefinder, using a large dataset consisting of expression values of 12,750 genes obtained from 1,616 women. We infer the gene network structure by cancer subtype, and find insightful subtype characteristics. For example, we find thirteen pathways which are enriched in each of the cancer groups but not in the Normal group, with two of the pathways associated with autoimmune diseases and two other with graft rejection. We also find specific characteristics of different breast cancer subtypes. For example, the Luminal A network includes a single, highly connected cluster of genes, which is enriched in the human diseases category, and in the Her2 subtype network we find a distinct, and highly interconnected cluster which is uniquely enriched in drug metabolism pathways.

[1]  Neil Swainston,et al.  Integration of metabolic databases for the reconstruction of genome-scale metabolic networks , 2010, BMC Systems Biology.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[4]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Liqing Zhang,et al.  A Network of SCOP Hidden Markov Models and Its Analysis , 2011, BMC Bioinformatics.

[6]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[7]  M. Newman,et al.  Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  A. Giordano,et al.  Pharmacometabolomics study identifies circulating spermidine and tryptophan as potential biomarkers associated with the complete pathological response to trastuzumab-paclitaxel neoadjuvant therapy in HER-2 positive breast cancer , 2016, Oncotarget.

[9]  Z N Oltvai,et al.  Evolutionary conservation of motif constituents in the yeast protein interaction network , 2003, Nature Genetics.

[10]  Adam J. Rothman,et al.  Sparse permutation invariant covariance estimation , 2008, 0801.4837.

[11]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[12]  Linyuan Lu,et al.  Link Prediction in Complex Networks: A Survey , 2010, ArXiv.

[13]  Kathryn Roeder,et al.  TESTING HIGH-DIMENSIONAL COVARIANCE MATRICES, WITH APPLICATION TO DETECTING SCHIZOPHRENIA RISK GENES. , 2016, The annals of applied statistics.

[14]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[15]  Jun Yu Li,et al.  Two Sample Tests for High Dimensional Covariance Matrices , 2012, 1206.0917.

[16]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  H E Stanley,et al.  Classes of small-world networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[20]  S. Rodenhuis,et al.  SERPINA6, BEX1, AGTR1, SLC26A3, and LAPTM4B are markers of resistance to neoadjuvant chemotherapy in HER2-negative breast cancer , 2012, Breast Cancer Research and Treatment.

[21]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[22]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[23]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[24]  Hierarchical Organization of Modularity in Metabolic Networks Supporting Online Material , 2002 .

[25]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[26]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[27]  Amin Allahyar,et al.  A data-driven interactome of synergistic genes improves network-based cancer outcome prediction , 2018, bioRxiv.

[28]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[29]  E. Nowara,et al.  The influence of steroid receptor status on the cardiotoxicity risk in HER2-positive breast cancer patients receiving trastuzumab , 2015, Archives of medical science : AMS.

[30]  Reginald D. Smith The network of collaboration among rappers and its community structure , 2005, physics/0511215.

[31]  Haim Bar,et al.  Differential variation and expression analysis , 2018 .

[32]  E. Smeland,et al.  Cross-linking of CD53 promotes activation of resting human B lymphocytes. , 1994, Journal of immunology.

[33]  Meenakshi Anurag,et al.  Comprehensive Profiling of DNA Repair Defects in Breast Cancer Identifies a Novel Class of Endocrine Therapy Resistance Drivers , 2018, Clinical Cancer Research.

[34]  Lodewyk F. A. Wessels,et al.  Current composite-feature classification methods do not outperform simple single-genes classifiers in breast cancer prognosis , 2013, Front. Genet..

[35]  P. Frankl,et al.  Some geometric applications of the beta distribution , 1990 .

[36]  David Warde-Farley,et al.  Dynamic modularity in protein interaction networks predicts breast cancer outcome , 2009, Nature Biotechnology.

[37]  A. Rettie,et al.  Cytochrome P450 3A4 and CYP3A5-Catalyzed Bioactivation of Lapatinib , 2016, Drug Metabolism and Disposition.

[38]  Weidong Liu,et al.  Large-Scale Multiple Testing of Correlations , 2016, Journal of the American Statistical Association.

[39]  V. Dunlock Tetraspanin CD53: an overlooked regulator of immune cell function , 2020, Medical Microbiology and Immunology.

[40]  Aleksander S Popel,et al.  Constructing the angiome: a global angiogenesis protein interaction network. , 2012, Physiological genomics.

[41]  Ruibin Xi,et al.  Differential network analysis via lasso penalized D-trace loss , 2015, 1511.09188.

[42]  James R. Schott,et al.  A test for the equality of covariance matrices when the dimension is large relative to the sample sizes , 2007, Comput. Stat. Data Anal..

[43]  Andreas Buja,et al.  Dosage-dependent phenotypes in models of 16p11.2 lesions found in autism , 2011, Proceedings of the National Academy of Sciences.

[44]  G. Shen,et al.  The Tetraspanin CD53 Regulates Early B Cell Development by Promoting IL-7R Signaling , 2019, The Journal of Immunology.

[45]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[46]  R. D. Fisher,et al.  Structure of the complex between HER2 and an antibody paratope formed by side chains from tryptophan and serine. , 2010, Journal of molecular biology.

[47]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[48]  R. Clarke,et al.  Autophagy and endocrine resistance in breast cancer , 2011, Expert review of anticancer therapy.

[49]  Gary D Bader,et al.  Global Mapping of the Yeast Genetic Interaction Network , 2004, Science.

[50]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[51]  Adam J. Rothman,et al.  Sparse estimation of large covariance matrices via a nested Lasso penalty , 2008, 0803.3872.

[52]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[53]  Jesper Tegnér,et al.  Reverse engineering gene networks using singular value decomposition and robust regression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[54]  T. Cai,et al.  Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings , 2013 .