PINCAGE: probabilistic integration of cancer genomics data for perturbed gene identification and sample classification

MOTIVATION Cancer development and progression is driven by a complex pattern of genomic and epigenomic perturbations. Both types of perturbations can affect gene expression levels and disease outcome. Integrative analysis of cancer genomics data may therefore improve detection of perturbed genes and prediction of disease state. As different data types are usually dependent, analysis based on independence assumptions will make inefficient use of the data and potentially lead to false conclusions. MODEL Here, we present PINCAGE (Probabilistic INtegration of CAncer GEnomics data), a method that uses probabilistic integration of cancer genomics data for combined evaluation of RNA-seq gene expression and 450k array DNA methylation measurements of promoters as well as gene bodies. It models the dependence between expression and methylation using modular graphical models, which also allows future inclusion of additional data types. RESULTS We apply our approach to a Breast Invasive Carcinoma dataset from The Cancer Genome Atlas consortium, which includes 82 adjacent normal and 730 cancer samples. We identify new biomarker candidates of breast cancer development (PTF1A, RABIF, RAG1AP1, TIMM17A, LOC148145) and progression (SERPINE3, ZNF706). PINCAGE discriminates better between normal and tumour tissue and between progressing and non-progressing tumours in comparison with established methods that assume independence between tested data types, especially when using evidence from multiple genes. Our method can be applied to any type of cancer or, more generally, to any genomic disease for which sufficient amount of molecular data is available. AVAILABILITY AND IMPLEMENTATION R scripts available at http://moma.ki.au.dk/prj/pincage/ CONTACT : michal.switnicki@clin.au.dk or jakob.skou@clin.au.dk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  W. Frommer,et al.  Sugar transporters for intercellular exchange and nutrition of pathogens , 2010, Nature.

[3]  D. Pe’er,et al.  An Integrated Approach to Uncover Drivers of Cancer , 2010, Cell.

[4]  R. Tibshirani,et al.  Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[5]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[6]  K. Gunderson,et al.  High density DNA methylation array with single CpG site resolution. , 2011, Genomics.

[7]  K. D. Sørensen,et al.  Prognostic DNA Methylation Markers for Prostate Cancer , 2014, International journal of molecular sciences.

[8]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[9]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[10]  A. Frigessi,et al.  Principles and methods of integrative genomic analyses in cancer , 2014, Nature Reviews Cancer.

[11]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[12]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[13]  M. Raghunath,et al.  High Resolution Methylome Map of Rat Indicates Role of Intragenic DNA Methylation in Identification of Coding Region , 2012, PloS one.

[14]  David Haussler,et al.  Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM , 2010, Bioinform..

[15]  Brian McConeghy,et al.  Heterogeneity in the inter-tumor transcriptome of high risk prostate cancer , 2014, Genome Biology.

[16]  G. Ast,et al.  DNA-methylation effect on cotranscriptional splicing is dependent on GC architecture of the exon–intron structure , 2013, Genome research.

[17]  P. Parrella Epigenetic Signatures in Breast Cancer: Clinical Perspective , 2010, Breast Care.

[18]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[19]  J. Issa,et al.  DNA methylation does not stably lock gene expression but instead serves as a molecular mark for gene silencing memory. , 2012, Cancer research.

[20]  J. Ni,et al.  Suppression of breast cancer growth and metastasis by a serpin myoepithelium-derived serine proteinase inhibitor expressed in the mammary myoepithelial cells. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[21]  M. Roizen,et al.  Hallmarks of Cancer: The Next Generation , 2012 .

[22]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[23]  M. Kris,et al.  Serpins Promote Cancer Cell Survival and Vascular Co-Option in Brain Metastasis , 2014, Cell.

[24]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[25]  Peter A. Jones,et al.  Targeting DNA methylation for epigenetic therapy. , 2010, Trends in pharmacological sciences.

[26]  Manolis Kellis,et al.  Large-scale epigenome imputation improves data quality and disease variant enrichment , 2015, Nature Biotechnology.

[27]  K. Kinzler,et al.  Cancer Genome Landscapes , 2013, Science.

[28]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[29]  C. Greenwood,et al.  Data Integration in Genetics and Genomics: Methods and Challenges , 2009, Human genomics and proteomics : HGP.

[30]  Jeffrey S. Morris,et al.  iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data , 2012, Bioinform..

[31]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[32]  Gangning Liang,et al.  Gene body methylation can alter gene expression and is a therapeutic target in cancer. , 2014, Cancer cell.

[33]  Peter A. Jones Functions of DNA methylation: islands, start sites, gene bodies and beyond , 2012, Nature Reviews Genetics.

[34]  C. Sotiriou,et al.  Evaluation of the Infinium Methylation 450K technology. , 2011, Epigenomics.

[35]  Haibo Wang,et al.  Selecting Features with Group-Sparse Nonnegative Supervised Canonical Correlation Analysis: Multimodal Prostate Cancer Prognosis , 2014, MICCAI.

[36]  A. Hattersley,et al.  Mutations in PTF1A cause pancreatic and cerebellar agenesis , 2004, Nature Genetics.

[37]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[38]  Peter A. Jones,et al.  Cancer genetics and epigenetics: two sides of the same coin? , 2012, Cancer cell.

[39]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[40]  Ju Han Kim,et al.  Incorporating inter-relationships between different levels of genomic data into cancer clinical outcome prediction. , 2014, Methods.

[41]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[42]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[43]  P. Laird,et al.  Genome-scale analysis of aberrant DNA methylation in colorectal cancer. , 2012, Genome research.

[44]  F. Real,et al.  Role of the basic helix-loop-helix transcription factor p48 in the differentiation phenotype of exocrine pancreas cancer cells. , 2000, Cell growth & differentiation : the molecular biology journal of the American Association for Cancer Research.

[45]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[46]  K. D. Sørensen,et al.  Hypermethylation of the GABRE∼miR-452∼miR-224 Promoter in Prostate Cancer Predicts Biochemical Recurrence after Radical Prostatectomy , 2014, Clinical Cancer Research.

[47]  Mikael Henaff,et al.  Information content and analysis methods for Multi-Modal High-Throughput Biomedical Data , 2014, Scientific Reports.

[48]  Syed Haider,et al.  International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data , 2011, Database J. Biol. Databases Curation.

[49]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[50]  J. Netterville,et al.  Laryngeal Squamous Cell Carcinoma: Advanced Disease , 2008 .

[51]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[52]  E. Tajara,et al.  Gene expression profiling reveals molecular marker candidates of laryngeal squamous cell carcinoma. , 2009, Oncology reports.

[53]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[54]  Chi V Dang,et al.  Cancer's molecular sweet tooth and the Warburg effect. , 2006, Cancer research.

[55]  K. D. Sørensen,et al.  Discovery of prostate cancer biomarkers by microarray gene expression profiling , 2010, Expert review of molecular diagnostics.

[56]  Rafael A. Irizarry,et al.  Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays , 2014, Bioinform..

[57]  E. Gilleland Two-dimensional kernel smoothing: Using the R package smoothie , 2013 .

[58]  Xiao Zhang,et al.  Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis , 2010, BMC Bioinformatics.

[59]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[60]  B. Tang,et al.  Rabs and cancer cell motility. , 2009, Cell motility and the cytoskeleton.

[61]  B. Berse,et al.  Molecular diagnostic testing in breast cancer. , 2015, Seminars in oncology nursing.

[62]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[63]  J. Polzehl,et al.  Propagation-Separation Approach for Local Likelihood Estimation , 2006 .

[64]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[65]  I. King Jordan,et al.  On the presence and role of human gene-body DNA methylation , 2012, Oncotarget.

[66]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[67]  Eugene S. Edgington,et al.  An Additive Method for Combining Probability Values from Independent Experiments , 1972 .

[68]  Thomas M. Loughin,et al.  A systematic comparison of methods for combining p , 2004, Comput. Stat. Data Anal..

[69]  S A Forbes,et al.  The Catalogue of Somatic Mutations in Cancer (COSMIC) , 2008, Current protocols in human genetics.

[70]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[71]  G. Andrew,et al.  arm: Data Analysis Using Regression and Multilevel/Hierarchical Models , 2014 .

[72]  Dan Wang,et al.  IMA: an R package for high-throughput analysis of Illumina's 450K Infinium methylation data , 2012, Bioinform..