DECO: decompose heterogeneous population cohorts for patient stratification and discovery of sample biomarkers using omic data profiling

Abstract Motivation Patient and sample diversity is one of the main challenges when dealing with clinical cohorts in biomedical genomics studies. During last decade, several methods have been developed to identify biomarkers assigned to specific individuals or subtypes of samples. However, current methods still fail to discover markers in complex scenarios where heterogeneity or hidden phenotypical factors are present. Here, we propose a method to analyze and understand heterogeneous data avoiding classical normalization approaches of reducing or removing variation. Results DEcomposing heterogeneous Cohorts using Omic data profiling (DECO) is a method to find significant association among biological features (biomarkers) and samples (individuals) analyzing large-scale omic data. The method identifies and categorizes biomarkers of specific phenotypic conditions based on a recurrent differential analysis integrated with a non-symmetrical correspondence analysis. DECO integrates both omic data dispersion and predictor–response relationship from non-symmetrical correspondence analysis in a unique statistic (called h-statistic), allowing the identification of closely related sample categories within complex cohorts. The performance is demonstrated using simulated data and five experimental transcriptomic datasets, and comparing to seven other methods. We show DECO greatly enhances the discovery and subtle identification of biomarkers, making it especially suited for deep and accurate patient stratification. Availability and implementation DECO is freely available as an R package (including a practical vignette) at Bioconductor repository (http://bioconductor.org/packages/deco/). Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[3]  C. Sotiriou,et al.  Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures , 2007, Breast Cancer Research.

[4]  Greg Tucker-Kellogg,et al.  A Comparison of Methods for Data-Driven Cancer Outlier Discovery, and An Application Scheme to Semisupervised Predictive Biomarker Discovery , 2011, Cancer informatics.

[5]  K. Morik,et al.  Robust Selection of Cancer Survival Signatures from High-Throughput Genomic Data Using Two-Fold Subsampling , 2014, PloS one.

[6]  Zihua Yang,et al.  Prediction of heterogeneous differential genes by detecting outliers to a Gaussian tight cluster , 2013, BMC Bioinformatics.

[7]  Mark J. Ratain,et al.  Tumour heterogeneity in the clinic , 2013, Nature.

[8]  J. Lebbink,et al.  Semi-quantitative proteomics of mammalian cells upon short-term exposure to non-ionizing electromagnetic fields , 2017, PloS one.

[9]  Lorena Díaz-González,et al.  Comparative Performance of Four Single Extreme Outlier Discordancy Tests from Monte Carlo Simulations , 2014, TheScientificWorldJournal.

[10]  Yutaka Saito,et al.  Epigenetic silencing of V(D)J recombination is a major determinant for selective differentiation of mucosal-associated invariant t cells from induced pluripotent stem cells , 2017, PloS one.

[11]  Melissa A. Troester,et al.  Intratumoral heterogeneity as a source of discordance in breast cancer biomarker classification , 2016, Breast Cancer Research.

[12]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[13]  E. Mastriani,et al.  Microarray-Based MicroRNA Expression Data Analysis with Bioconductor. , 2018, Methods in molecular biology.

[14]  J. Wang,et al.  Identification of MAGEA12 as a prognostic outlier gene in gastric cancers. , 2016, Neoplasma.

[15]  R. Gillies,et al.  Evolutionary dynamics of carcinogenesis and why targeted therapy does not work , 2012, Nature Reviews Cancer.

[16]  A. Sieuwerts,et al.  The challenge of gene expression profiling in heterogeneous clinical samples. , 2013, Methods.

[17]  Eric J. Beh,et al.  Correspondence Analysis: Theory, Practice and New Strategies , 2014 .

[18]  Jane Fridlyand,et al.  Differentiation of lobular versus ductal breast carcinomas by expression microarray analysis. , 2003, Cancer research.

[19]  Xing Qiu,et al.  Assessing stability of gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[20]  Meng-zhu Xue,et al.  Computational identification of mutually exclusive transcriptional drivers dysregulating metastatic microRNAs in prostate cancer , 2017, Nature Communications.

[21]  Peter Ghazal,et al.  Multi-Factorial Analysis of Class Prediction Error: Estimating Optimal Number of Biomarkers for Various Classification Rules , 2010, J. Bioinform. Comput. Biol..

[22]  Y. Tomer,et al.  DNA methylation profiles in type 1 diabetes twins point to strong epigenetic effects on etiology. , 2014, Journal of autoimmunity.

[23]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[24]  David Hsiang,et al.  Diffuse optical spectroscopy measurements of healing in breast tissue after core biopsy: case study. , 2009, Journal of biomedical optics.

[25]  S. Gore,et al.  Risk stratification in myelodysplastic syndromes: is there a role for gene expression profiling? , 2014, Expert review of hematology.

[26]  Steven J. M. Jones,et al.  Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer , 2015, Cell.

[27]  Andrew J. Kavran,et al.  Specificity of Phosphorylation Responses to Mitogen Activated Protein (MAP) Kinase Pathway Inhibitors in Melanoma Cells* , 2017, Molecular & Cellular Proteomics.

[28]  Sambasivarao Damaraju,et al.  Effects of Sample Size on Differential Gene Expression, Rank Order and Prediction Accuracy of a Gene Signature , 2013, PloS one.

[29]  J. Tchinda,et al.  Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. , 2006, Science.

[30]  Baolin Wu,et al.  Cancer outlier differential gene expression detection. , 2007, Biostatistics.

[31]  Donna K. Slonim,et al.  CSAX: Characterizing Systematic Anomalies in eXpression Data , 2014, RECOMB.

[32]  B. Margolin,et al.  An Analysis of Variance for Categorical Data , 1971 .

[33]  E. Andres Houseman,et al.  Normal breast tissue DNA methylation differences at regulatory elements are associated with the cancer risk factor age , 2017, Breast Cancer Research.

[34]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[35]  G. Babu Subsample and half-sample methods , 1992 .

[36]  Debashis Ghosh,et al.  COPA - cancer outlier profile analysis , 2006, Bioinform..

[37]  Pedro Carmona-Saez,et al.  mCSEA: Detecting subtle differentially methylated regions , 2018, bioRxiv.

[38]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[39]  L. Esserman,et al.  A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. , 2011, JAMA.

[40]  C. Kahn,et al.  Adipose-Derived Circulating miRNAs Regulate Gene Expression in Other Tissues , 2017, Nature.

[41]  Daria A. Gaykalova,et al.  Integrative computational analysis of transcriptional and epigenetic alterations implicates DTX1 as a putative tumor suppressor gene in HNSCC , 2017, Oncotarget.

[42]  P. Afonso,et al.  Proteomic analysis of plasma extracellular vesicles reveals mitochondrial stress upon HTLV-1 infection , 2018, Scientific Reports.

[43]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[44]  P. Gonzalez-Alegre,et al.  Towards precision medicine , 2017 .

[45]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[46]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[47]  L. Wessels,et al.  Defining chromosomal translocation risks in cancer , 2016, Proceedings of the National Academy of Sciences.

[48]  Albert Sickmann,et al.  Current strategies and findings in clinically relevant post-translational modification-specific proteomics , 2015, Expert review of proteomics.

[49]  H. Lian MOST: detecting cancer differential gene expression. , 2007, Biostatistics.

[50]  Javier De Las Rivas,et al.  GATExplorer: Genomic and Transcriptomic Explorer; mapping expression probes to gene loci, transcripts, exons and ncRNAs , 2010, BMC Bioinformatics.

[51]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[52]  E P Noble,et al.  Genome-wide DNA methylation analysis of human brain tissue from schizophrenia patients , 2014, Translational Psychiatry.

[53]  Michele De Palma,et al.  The biology of personalized cancer medicine: Facing individual complexities underlying hallmark capabilities , 2012, Molecular oncology.

[54]  M. Sivabalakrishnan,et al.  Feature Selection of Gene Expression Data for Cancer Classification: A Review , 2015 .

[55]  Mark A. Ragan,et al.  mCOPA: analysis of heterogeneous features in cancer expression data , 2012, Journal of Clinical Bioinformatics.

[56]  Michael P. Schroeder,et al.  In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals targeting opportunities. , 2015, Cancer cell.

[57]  Guillem Rigaill,et al.  Identifying subgroup markers in heterogeneous populations , 2013, Nucleic acids research.

[58]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[59]  R. Simon,et al.  Sample size determination in microarray experiments for class comparison and prognostic classification. , 2005, Biostatistics.

[60]  Frank Preiswerk,et al.  Stability of gene contributions and identification of outliers in multivariate analysis of microarray data , 2008, BMC Bioinformatics.

[61]  E. Ashley Towards precision medicine , 2016, Nature Reviews Genetics.

[62]  Andrew H. Beck,et al.  EMDomics: a robust and powerful method for the identification of genes differentially expressed between heterogeneous classes , 2015, Bioinform..

[63]  Arturo Araujo,et al.  Cancer heterogeneity: converting a limitation into a source of biologic information , 2017, Journal of Translational Medicine.

[64]  Ahmet Rasit Ozturk,et al.  A resampling-based meta-analysis for detection of differential gene expression in breast cancer , 2008, BMC Cancer.

[65]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[66]  Adrian V. Lee,et al.  Intratumor Heterogeneity Affects Gene Expression Profile Test Prognostic Risk Stratification in Early Breast Cancer , 2016, Clinical Cancer Research.

[67]  R. Tibshirani,et al.  Outlier sums for differential gene expression analysis. , 2007, Biostatistics.

[68]  Li Li,et al.  PADGE: analysis of heterogeneous patterns of differential gene expression. , 2007, Physiological genomics.

[69]  Adam A. Margolin,et al.  Empirical Bayes Analysis of Quantitative Proteomics Experiments , 2009, PloS one.

[70]  Romdhane Rekaya,et al.  LSOSS: Detection of Cancer Outlier Differential Gene Expression , 2010, Biomarker insights.

[71]  Wen-Yuan Guo,et al.  Treatment with an SLC12A1 antagonist inhibits tumorigenesis in a subset of hepatocellular carcinomas , 2016, Oncotarget.

[72]  Lily Ting,et al.  Normalization and Statistical Analysis of Quantitative Proteomics Data Generated by Metabolic Labeling* , 2009, Molecular & Cellular Proteomics.