Iterative integrated imputation for missing data and pathway models with applications to breast cancer subtypes

Tumor development is driven by complex combinations of biological elements. Recent advances suggest that molecularly distinct subtypes of breast cancers may respond differently to pathway-targeted therapies. Thus, it is important to dissect pathway disturbances by integrating multiple molecular profiles, such as genetic, genomic and epigenomic data. However, missing data are often present in the -omic profiles of interest. Motivated by genomic data integration and imputation, we present a new statistical framework for pathway significance analysis. Specifically, we develop a new strategy for imputation of missing data in large-scale genomic studies, which adapts low-rank, structured matrix completion. Our iterative strategy enables us to impute missing data in complex configurations across multiple data platforms. In turn, we perform large-scale pathway analysis integrating gene expression, copy number, and methylation data. The advantages of the proposed statistical framework are demonstrated through simulations and real applications to breast cancer subtypes. We demonstrate superior power to identify pathway disturbances, compared with other imputation strategies. We also identify differential pathway activity across different breast tumor subtypes.

[1]  Peng Qiu,et al.  TCGA-Assembler: open-source software for retrieving and processing TCGA data , 2014, Nature Methods.

[2]  Alexandra L. Thomas,et al.  Inhibition of CDK-mediated Smad3 phosphorylation reduces the Pin1-Smad3 interaction and aggressiveness of triple negative breast cancer cells , 2017, Cell cycle.

[3]  G. Michailidis,et al.  Network Enrichment Analysis in Complex Experiments , 2010, Statistical applications in genetics and molecular biology.

[4]  Kaanan P. Shah,et al.  A gene-based association method for mapping traits using reference transcriptome data , 2015, Nature Genetics.

[5]  R. Lothe,et al.  Portrait of the PI3K/AKT pathway in colorectal cancer. , 2015, Biochimica et biophysica acta.

[6]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[7]  Charles M. Perou,et al.  Ki67 Index, HER2 Status, and Prognosis of Patients With Luminal B Breast Cancer , 2009, Journal of the National Cancer Institute.

[8]  J. Schug,et al.  Genome-Wide Location Analysis Reveals Distinct Transcriptional Circuitry by Paralogous Regulators Foxa1 and Foxa2 , 2012, PLoS genetics.

[9]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[10]  Jianguo Song,et al.  Epigenetic regulation of Smad2 and Smad3 by profilin-2 promotes lung cancer growth and metastasis , 2015, Nature Communications.

[11]  Hongyu Zhao,et al.  Dissecting Pathway Disturbances Using Network Topology and Multi-platform Genomics Data , 2017, Statistics in Biosciences.

[12]  Bolin Liu,et al.  Targeting of erbB3 receptor to overcome resistance in cancer treatment , 2014, Molecular Cancer.

[13]  Jj Allaire,et al.  Web Application Framework for R , 2016 .

[14]  A. McKenna,et al.  Synthesizing Signaling Pathways from Temporal Phosphoproteomic Data , 2017, bioRxiv.

[15]  Zhonghu Bai,et al.  Breast cancer intrinsic subtype classification, clinical use and future trends. , 2015, American journal of cancer research.

[16]  Yuan Ji,et al.  TCGA-Assembler 2: Software Pipeline for Retrieval and Processing of TCGA/CPTAC Data , 2017, bioRxiv.

[17]  Beatriz Peñalver Bernabé,et al.  Inhibition of CDK-mediated phosphorylation of Smad3 results in decreased oncogenesis in triple negative breast cancer cells , 2014, Cell cycle.

[18]  Anne-Laure Boulesteix,et al.  Regularized estimation of large-scale gene association networks using graphical Gaussian models , 2009, BMC Bioinformatics.

[19]  Naoki Matsuda,et al.  System identification of signaling dependent gene expression with different time-scale data , 2017, PLoS Comput. Biol..

[20]  Anru Zhang,et al.  Structured Matrix Completion with Applications to Genomic Data Integration , 2015, Journal of the American Statistical Association.

[21]  Yuping Zhang,et al.  A STATISTICAL FRAMEWORK FOR DATA INTEGRATION THROUGH GRAPHICAL MODELS WITH APPLICATION TO CANCER GENOMICS. , 2017, The annals of applied statistics.

[22]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[23]  J. Jeruss,et al.  Cyclin-Dependent Kinase 4–Mediated Phosphorylation Inhibits Smad3 Activity in Cyclin D–Overexpressing Breast Cancer Cells , 2010, Molecular Cancer Research.

[24]  Allison P. Heath,et al.  Toward a Shared Vision for Cancer Genomic Data. , 2016, The New England journal of medicine.

[25]  Elena B. Pasquale,et al.  Eph receptors and ephrins in cancer: bidirectional signalling and beyond , 2010, Nature Reviews Cancer.

[26]  A. Andres,et al.  The multifaceted roles of Eph/ephrin signaling in breast cancer , 2012, Cell adhesion & migration.

[27]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[28]  Heather J. Cordell,et al.  Comparison of methods for transcriptome imputation through application to two common complex diseases , 2018, European Journal of Human Genetics.

[29]  Mark Huisman,et al.  Multiple imputation for longitudinal network data , 2018 .

[30]  Suzanne A. W. Fuqua,et al.  Estrogen receptor and breast cancer , 2001 .

[31]  Alberto Franzin,et al.  bnstruct: an R package for Bayesian Network structure learning in the presence of missing data , 2016, Bioinform..

[32]  Ali Shojaie,et al.  Analysis of Gene Sets Based on the Underlying Regulatory Network , 2009, J. Comput. Biol..

[33]  B. Ponder,et al.  Association of single-nucleotide polymorphisms in the cell cycle genes with breast cancer in the British population. , 2008, Carcinogenesis.

[34]  Marco Y. Hein,et al.  The Perseus computational platform for comprehensive analysis of (prote)omics data , 2016, Nature Methods.

[35]  David Haussler,et al.  Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM , 2010, Bioinform..

[36]  Dong-Guk Shin,et al.  A route-based pathway analysis framework integrating mutation information and gene expression data. , 2017, Methods.

[37]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[38]  Andres Metspalu,et al.  PAIRUP-MS: Pathway analysis and imputation to relate unknowns in profiles from mass spectrometry-based metabolite data , 2017, bioRxiv.

[39]  R. Bernards,et al.  Targeting the RB-E2F pathway in breast cancer , 2016, Oncogene.

[40]  Roland Eils,et al.  Integrative genomic and transcriptomic analysis of leiomyosarcoma , 2018, Nature Communications.

[41]  Kenneth H. Buetow,et al.  PID: the Pathway Interaction Database , 2008, Nucleic Acids Res..

[42]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[43]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[44]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[45]  H. Moses,et al.  A tale of two proteins: Differential roles and regulation of Smad2 and Smad3 in TGF‐β signaling , 2007, Journal of cellular biochemistry.

[46]  Di Wu,et al.  ROAST: rotation gene set tests for complex microarray experiments , 2010, Bioinform..

[47]  Benno Pütz,et al.  Genome-wide mapping of genetic determinants influencing DNA methylation and gene expression in human hippocampus , 2017, Nature Communications.

[48]  T. Lehtimäki,et al.  Integrative approaches for large-scale transcriptome-wide association studies , 2015, Nature Genetics.

[49]  J. Nevins,et al.  The Rb/E2F pathway and cancer. , 2001, Human molecular genetics.