A Statistical Framework for Incorporating Multi-Omics Technologies for Precision Medicine

This thesis addresses important statistical challenges in precision medicine, the clinical practice to customise treatment plans for individual patients using genetic information. We propose methods, frameworks and procedures that tackle the discovery, translation and implementation of precision medicine through the use of omics data. Specifically, we examine three key aspects of precision medicine: (1) The use of targeted assays as opposed to whole-transcriptome technologies like microarray and bulk RNA-Sequencing. In particular, we focus on the impact of a gene-set selection bias on classical gene-set tests in targeted assays; (2) The challenges of stable feature selection methods when dealing with mediumthroughput targeted assays; and (3) The development of a prediction model on noisy and batched gene expression data from multiple sources that can be implemented under clinical constraints. In Chapter 1, we provide a brief overview of precision medicine with a special focus on targeted assays as a potential instrument for large-scale implementation. We present an overview of the statistical challenges that must be overcome in three phases of precision medicine in order to facilitate the final implementation. Chapter 2 of this thesis focuses on adapting the classical gene-set tests for targeted assays by proposing a new method, bcGST (bias-corrected Gene Set Test). We design synthetic simulations in eleven publicly available whole-genome expression data and show how bcGST improves over cases where the gene-set selection bias is ignored. Most of this chapter was published in Wang et al. (2019a). 3 In Chapter 3 we propose a novel variable selection method, APES (APproximated Exhaustive Search), for generalised linear models (GLMs). APES modernises the classical exhaustive variable selection problem for GLMs, enabling model selection with hundreds of variables. It is therefore capable of analysing data from commercially available targeted assays. The advantage of APES lies in its ability to approximate a genuine exhaustive search at a dramatically improved speed. We devise a comprehensive set of simulations to test APES’s performance and apply it to a real targeted assay. Most of this chapter was published in Wang et al. (2019b). Chapter 4 is motivated by irreproducible results in research due to inconsistencies in the scaling of omics data and the corresponding tuning of parameters in prediction models. We first define a notion of “transferability” to describe a model’s ability to maintain high predictive power across multiple omics datasets with no additional manipulations on the model. Then, we propose the Cross-Platform Omics Prediction (CPOP) procedure that constructs transferable models. These models have biologically relevant features that are statistically stable with respect to between-data noise. CPOP is specially designed for prediction across different omics platforms and prospective experiments that are commonly seen in clinical settings. We curate four melanoma datasets and a prospective targeted assay experiment to illustrate the novelty of CPOP. This thesis concludes with some final remarks in Chapter 5. In summary, this thesis contributes to precision medicine research by developing relevant, interpretable and implementable statistical methods.

[1]  Insuk Sohn,et al.  Statistical Issues in the Design and Analysis of nCounter Projects , 2014, Cancer informatics.

[2]  Casey S. Greene,et al.  Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously , 2017, bioRxiv.

[3]  Bin Yu,et al.  Estimation Stability With Cross-Validation (ESCV) , 2013, 1303.3128.

[4]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[5]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[6]  H. Moch,et al.  Comparison of EndoPredict and Oncotype DX Test Results in Hormone Receptor Positive Invasive Breast Cancer , 2013, PloS one.

[7]  C. Garbe,et al.  The mitogen-activated protein kinase pathway in melanoma part I - Activation and primary resistance mechanisms to BRAF inhibition. , 2017, European journal of cancer.

[8]  Alan R. Moody,et al.  From Big Data to Precision Medicine , 2019, Front. Med..

[9]  Qing-Rong Chen,et al.  An integrated cross-platform prognosis study on neuroblastoma patients. , 2008, Genomics.

[10]  R. Tibshirani,et al.  Strong rules for discarding predictors in lasso‐type problems , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[11]  Vivek Jayaswal,et al.  Disturbed protein–protein interaction networks in metastatic melanoma are associated with worse prognosis and increased functional mutation burden , 2013, Pigment cell & melanoma research.

[12]  K Murray,et al.  Graphical tools for model selection in generalized linear models , 2013, Statistics in medicine.

[13]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[14]  R. Tibshirani,et al.  Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso , 2017, 1707.08692.

[15]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[16]  Robert Haas,et al.  Designing and interpreting ‘multi-omic’ experiments that may change our understanding of biology , 2017, Current opinion in systems biology.

[17]  Christina Backes,et al.  GeneTrail—advanced gene set enrichment analysis , 2007, Nucleic Acids Res..

[18]  Jie Tan,et al.  Cross-platform normalization of microarray and RNA-seq data for machine learning applications , 2016, PeerJ.

[19]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[20]  Ulrich Mansmann,et al.  A 29-gene and cytogenetic score for the prediction of resistance to induction treatment in acute myeloid leukemia , 2017, Haematologica.

[21]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[22]  Torsten Hothorn,et al.  On the Exact Distribution of Maximally Selected Rank Statistics , 2002, Comput. Stat. Data Anal..

[23]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[24]  F. Offner,et al.  MammaPrint versus EndoPredict: Poor correlation in disease recurrence risk classification of hormone receptor positive breast cancer , 2017, PloS one.

[25]  Israel Steinfeld,et al.  BMC Bioinformatics BioMed Central , 2008 .

[26]  J. Kirkwood,et al.  Targeting the MAPK pathway in advanced BRAF wild-type melanoma. , 2019, Annals of oncology : official journal of the European Society for Medical Oncology.

[27]  David R. Anderson,et al.  Model Selection and Multimodel Inference , 2003 .

[28]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[29]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[30]  G. Mann,et al.  Distinct Molecular Profiles and Immunotherapy Treatment Outcomes of V600E and V600K BRAF-Mutant Melanoma , 2019, Clinical Cancer Research.

[31]  J. Ajani,et al.  Development and Validation of a Six-Gene Recurrence Risk Score Assay for Gastric Cancer , 2016, Clinical Cancer Research.

[32]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[33]  M. Pencina,et al.  General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study , 2008, Circulation.

[34]  Jiang Li,et al.  Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data , 2013, PloS one.

[35]  C. Berking,et al.  A nine-gene signature predicting clinical outcome in cutaneous melanoma , 2013, Journal of Cancer Research and Clinical Oncology.

[36]  David G. Robinson,et al.  A nested parallel experiment demonstrates differences in intensity-dependence between RNA-seq and microarrays , 2014, bioRxiv.

[37]  Hongzhe Li,et al.  Variable selection in regression with compositional covariates , 2014 .

[38]  Di Wu,et al.  ROAST: rotation gene set tests for complex microarray experiments , 2010, Bioinform..

[39]  Robert Tibshirani,et al.  Log‐ratio lasso: Scalable, sparse estimation for log‐ratio models , 2017, Biometrics.

[40]  Adrian Alexa,et al.  Gene set enrichment analysis with topGO , 2006 .

[41]  Chen Lin,et al.  MCVIS: A New Framework for Collinearity Discovery, Diagnostic, and Visualization , 2020 .

[42]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[43]  Guoshuai Cai,et al.  Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data , 2018, Bioinform..

[44]  Rainer Spang,et al.  Molecular signatures that can be transferred across different omics platforms , 2017, Bioinform..

[45]  Paul C Boutros,et al.  Systematic evaluation of medium-throughput mRNA abundance platforms. , 2013, RNA.

[46]  Alan Sharpe,et al.  High-Frequency Targetable EGFR Mutations in Sinonasal Squamous Cell Carcinomas Arising from Inverted Sinonasal Papilloma. , 2015, Cancer research.

[47]  I. Nookaew,et al.  A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae , 2012, Nucleic acids research.

[48]  Matthew Wongchenko,et al.  bcGST - an interactive bias-correction method to identify over-represented gene-sets in boutique arrays , 2017, bioRxiv.

[49]  G. Claeskens Statistical Model Choice , 2016 .

[50]  Dario Strbenac,et al.  Melanoma Explorer: a web application to allow easy reanalysis of publicly available and clinically annotated melanoma omics data sets , 2019, Melanoma research.

[51]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[52]  H. Akaike A new look at the statistical model identification , 1974 .

[53]  Terence P Speed,et al.  A new normalization for Nanostring nCounter gene expression data , 2019, Nucleic acids research.

[54]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[55]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[56]  Dimitris Bertsimas,et al.  Logistic Regression: From Art to Science , 2017 .

[57]  Donna K. Slonim,et al.  Getting Started in Gene Expression Microarray Analysis , 2009, PLoS Comput. Biol..

[58]  Jean Y. H. Yang,et al.  Fast and approximate exhaustive variable selection for generalised linear models with APES , 2019, Australian & New Zealand Journal of Statistics.

[59]  Rehman Qureshi,et al.  A Gene Expression Classifier from Whole Blood Distinguishes Benign from Malignant Lung Nodules Detected by Low-Dose CT. , 2018, Cancer research.

[60]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[61]  Faramarz Valafar,et al.  Empirical comparison of cross-platform normalization methods for gene expression data , 2011, BMC Bioinformatics.

[62]  S. Müller,et al.  On Model Selection Curves , 2010 .

[63]  Qing-zuo Liu,et al.  Risk score based on three mRNA expression predicts the survival of bladder cancer. , 2017, Oncotarget.

[64]  Hussein Hazimeh,et al.  Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms , 2018, Oper. Res..

[65]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[66]  Y. Fujii,et al.  Gene expression signature‐based prognostic risk score in patients with glioblastoma , 2013, Cancer science.

[67]  Fuqiang Pan,et al.  Distinct prognostic value of mRNA expression of guanylate-binding protein genes in skin cutaneous melanoma , 2018, Oncology letters.

[68]  Christiani A. Amorim,et al.  A Draft Map of the Human Ovarian Proteome for Tissue Engineering and Clinical Applications , 2018, Molecular & Cellular Proteomics.

[69]  Jeremy J. W. Chen,et al.  A five-gene signature and clinical outcome in non-small-cell lung cancer. , 2007, The New England journal of medicine.

[70]  Emmanuel J. Candès,et al.  False Discoveries Occur Early on the Lasso Path , 2015, ArXiv.

[71]  Michael L. Littman,et al.  Bayesian Adaptive Sampling for Variable Selection and Model Averaging , 2011 .

[72]  Jean-Yves Audibert,et al.  Robust linear least squares regression , 2010, 1010.0074.

[73]  Shila Ghazanfar,et al.  scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets , 2019, Proceedings of the National Academy of Sciences.

[74]  Dong-Hyung Cho,et al.  A nineteen gene‐based risk score classifier predicts prognosis of colorectal cancer patients , 2014, Molecular oncology.

[75]  C. Creighton,et al.  MAPK4 overexpression promotes tumor progression via noncanonical activation of AKT/mTOR signaling , 2019, The Journal of clinical investigation.

[76]  F. Baehner,et al.  A Prospective Comparison of the 21-Gene Recurrence Score and the PAM50-Based Prosigna in Estrogen Receptor-Positive Early-Stage Breast Cancer , 2015, Advances in Therapy.

[77]  Robert W. Wilson,et al.  Regressions by Leaps and Bounds , 2000, Technometrics.

[78]  J. Shao AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION , 1997 .

[79]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[80]  Dario Strbenac,et al.  ClassifyR: an R package for performance assessment of classification with applications to transcriptomics , 2015, Bioinform..

[81]  B. Di Camillo,et al.  Measuring differential gene expression with RNA-seq: challenges and strategies for data analysis. , 2015, Briefings in functional genomics.

[82]  Samuel Müller,et al.  Determination of prognosis in metastatic melanoma through integration of clinico‐pathologic, mutation, mRNA, microRNA, and protein information , 2015, International journal of cancer.

[83]  A. Berjón,et al.  Comparison of risk classification between EndoPredict and MammaPrint in ER-positive/HER2-negative primary invasive breast cancer , 2017, PloS one.

[84]  Jill P Mesirov,et al.  Criteria for the use of omics-based predictors in clinical trials: explanation and elaboration , 2013, BMC Medicine.

[85]  A. Witteveen,et al.  Converting a breast cancer microarray signature into a high-throughput diagnostic test , 2006, BMC Genomics.

[86]  Hailiang Huang,et al.  Characterization of candidate genes in inflammatory bowel disease-associated risk loci. , 2016, JCI insight.

[87]  Tianbao Yang,et al.  Efficient Feature Screening for Lasso-Type Problems via Hybrid Safe-Strong Rules , 2017, 1704.08742.

[88]  D. Levy,et al.  Prediction of coronary heart disease using risk factor categories. , 1998, Circulation.

[89]  Rainer Spang,et al.  Reference point insensitive molecular data analysis , 2017, Bioinform..

[90]  B. Klein,et al.  Identification of a 20-Gene Expression-Based Risk Score as a Predictor of Clinical Outcome in Chronic Lymphocytic Leukemia Patients , 2014, BioMed research international.

[91]  Rafael A Irizarry,et al.  Gene set enrichment analysis made simple , 2009, Statistical methods in medical research.

[92]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[93]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[94]  Johan Staaf,et al.  Molecular stratification of metastatic melanoma using gene expression profiling : Prediction of survival outcome and benefit from molecular targeted therapy , 2015, Oncotarget.

[95]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[96]  L. Cardon,et al.  Precision medicine, genomics and drug discovery. , 2016, Human molecular genetics.

[97]  Arianne C Richard,et al.  Comparison of gene expression microarray data with count-based RNA measurements informs microarray interpretation , 2014, BMC Genomics.

[98]  P. Gonzalez-Alegre,et al.  Towards precision medicine , 2017 .

[99]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[100]  S. Müller,et al.  Model Selection in Linear Mixed Models , 2013, 1306.2427.

[101]  L. V. van't Veer,et al.  Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. , 2006, Journal of the National Cancer Institute.

[102]  C. Wells,et al.  YuGene: a simple approach to scale gene expression data derived from different platforms for integrated analyses. , 2014, Genomics.

[103]  Gaorong Li,et al.  Greedy forward regression for variable screening , 2015, 1511.01124.

[104]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[105]  David W. Hosmer,et al.  Best subsets logistic regression , 1989 .

[106]  Alan Welsh,et al.  mplot: An R Package for Graphical Model Stability and Variable Selection Procedures , 2015, 1509.07583.

[107]  F. Liu,et al.  A gene expression-based risk model reveals prognosis of gastric cancer , 2018, PeerJ.

[108]  X. Chen,et al.  Comparison of Nanostring nCounter® Data on FFPE Colon Cancer Samples and Affymetrix Microarray Data on Matched Frozen Tissues , 2016, PloS one.

[109]  Alan J. Miller,et al.  leaps: Regression Subset Selection. , 2004 .

[110]  Georg Heinze,et al.  Variable selection – A review and recommendations for the practicing statistician , 2018, Biometrical journal. Biometrische Zeitschrift.

[111]  Wayne Xu,et al.  Gene Expression Detection Assay for Cancer Clinical Use , 2018, Journal of Cancer.

[112]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[113]  Yuhong Yang Can the Strengths of AIC and BIC Be Shared , 2005 .

[114]  Sijian Wang,et al.  RANDOM LASSO. , 2011, The annals of applied statistics.

[115]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[116]  Wei Shi,et al.  Optimizing the noise versus bias trade-off for Illumina whole genome expression BeadChips , 2010, Nucleic acids research.

[117]  Samuel Müller,et al.  Outlier Robust Model Selection in Linear Regression , 2005 .

[118]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[119]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[120]  B. Klein,et al.  Gene expression-based risk score in diffuse large B-cell lymphoma , 2012, Oncotarget.