Distinguishing prognostic and predictive biomarkers: an information theoretic approach

Abstract Motivation The identification of biomarkers to support decision-making is central to personalized medicine, in both clinical and research scenarios. The challenge can be seen in two halves: identifying predictive markers, which guide the development/use of tailored therapies; and identifying prognostic markers, which guide other aspects of care and clinical trial planning, i.e. prognostic markers can be considered as covariates for stratification. Mistakenly assuming a biomarker to be predictive, when it is in fact largely prognostic (and vice-versa) is highly undesirable, and can result in financial, ethical and personal consequences. We present a framework for data-driven ranking of biomarkers on their prognostic/predictive strength, using a novel information theoretic method. This approach provides a natural algebra to discuss and quantify the individual predictive and prognostic strength, in a self-consistent mathematical framework. Results Our contribution is a novel procedure, INFO+, which naturally distinguishes the prognostic versus predictive role of each biomarker and handles higher order interactions. In a comprehensive empirical evaluation INFO+ outperforms more complex methods, most notably when noise factors dominate, and biomarkers are likely to be falsely identified as predictive, when in fact they are just strongly prognostic. Furthermore, we show that our methods can be 1–3 orders of magnitude faster than competitors, making it useful for biomarker discovery in ‘big data’ scenarios. Finally, we apply our methods to identify predictive biomarkers on two real clinical trials, and introduce a new graphical representation that provides greater insight into the prognostic and predictive strength of each biomarker. Availability and implementation R implementations of the suggested methods are available at https://github.com/sechidis. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  John L.P. Thompson,et al.  Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[2]  I. Lipkovich,et al.  Subgroup identification based on differential effect search—A recursive partitioning method for establishing response to treatment in patient subpopulations , 2011, Statistics in medicine.

[3]  Gavin Brown,et al.  Simple strategies for semi-supervised feature selection , 2017, Machine Learning.

[4]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .

[5]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[6]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[7]  M. Scutari,et al.  Bayesian Network Structure Learning with Permutation Tests , 2011, 1101.5184.

[8]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[9]  Kurt Ulm,et al.  Responder identification in clinical trials with censored data , 2006, Comput. Stat. Data Anal..

[10]  Lu Tian,et al.  A Simple Method for Detecting Interactions between a Treatment and a Large Number of Covariates , 2012, 1212.2995.

[11]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[12]  J. M. Taylor,et al.  Subgroup identification from randomized clinical trial data , 2011, Statistics in medicine.

[13]  Korbinian Strimmer,et al.  Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks , 2008, J. Mach. Learn. Res..

[14]  R. Simon,et al.  Adaptive Signature Design: An Adaptive Clinical Trial Design for Generating and Prospectively Testing A Gene Expression Signature for Sensitive Patients , 2005, Clinical Cancer Research.

[15]  Gert Mayer,et al.  Rosuvastatin and cardiovascular events in patients undergoing hemodialysis. , 2009, The New England journal of medicine.

[16]  T. Mok,et al.  Gefitinib or carboplatin-paclitaxel in pulmonary adenocarcinoma. , 2009, The New England journal of medicine.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[19]  K. Ballman,et al.  Biomarker: Predictive or Prognostic? , 2015, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[20]  Hadley Wickham,et al.  Graphics for Statistics and Data Analysis with R , 2010 .

[21]  Guoping Zeng,et al.  A Unified Definition of Mutual Information with Applications in Machine Learning , 2015 .

[22]  Nor Hayati Othman,et al.  A review of feature selection techniques via gene expression profiles , 2008, 2008 International Symposium on Information Technology.

[23]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[24]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[25]  Lloyd,et al.  Use of mutual information to decrease entropy: Implications for the second law of thermodynamics. , 1989, Physical review. A, General physics.

[26]  T. Friede,et al.  Methods for identification and confirmation of targeted subgroups in clinical trials: A systematic review , 2016, Journal of biopharmaceutical statistics.

[27]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[28]  Tarald O. Kvålseth,et al.  On Normalized Mutual Information: Measure Derivations and Properties , 2017, Entropy.

[29]  W. Loh,et al.  A regression tree approach to identifying subgroups with differential treatment effects , 2014, Statistics in medicine.

[30]  F. Zannad,et al.  Determinants of Cardiovascular Risk in Haemodialysis Patients: Post hoc Analyses of the AURORA Study , 2013, American Journal of Nephrology.

[31]  David R. Brillinger,et al.  Some data analyses using mutual information , 2004 .

[32]  Hansheng Wang,et al.  Subgroup Analysis via Recursive Partitioning , 2009, J. Mach. Learn. Res..

[33]  Masahiro Fukuoka,et al.  Biomarker analyses and final overall survival results from a phase III, randomized, open-label, first-line study of gefitinib versus carboplatin/paclitaxel in clinically selected patients with advanced non-small-cell lung cancer in Asia (IPASS). , 2011, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[34]  Gavin Brown,et al.  Dealing with under-reported variables: An information theoretic solution , 2017, Int. J. Approx. Reason..

[35]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[36]  Stephen J. Ruberg,et al.  Personalized Medicine: Four Perspectives of Tailored Medicine , 2015 .

[37]  Barnabás Póczos,et al.  Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs , 2010, NIPS.

[38]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[39]  Christian P. Robert,et al.  Large-scale inference , 2010 .

[40]  Narayanan Unny Edakunni,et al.  Beyond Fano's inequality: bounds on the optimal F-score, BER, and cost-sensitive risk and their implications , 2013, J. Mach. Learn. Res..

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  Bernhard Schölkopf,et al.  Kernel Methods for Measuring Independence , 2005, J. Mach. Learn. Res..

[43]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[44]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[45]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[46]  Xiaoqiang Ding,et al.  Monocyte/lymphocyte ratio as a better predictor of cardiovascular and all‐cause mortality in hemodialysis patients: A prospective cohort study , 2018, Hemodialysis international. International Symposium on Home Hemodialysis.

[47]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[48]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[49]  Xiaogang Su,et al.  Interaction Trees with Censored Survival Data , 2008, The international journal of biostatistics.

[50]  Fraser,et al.  Independent coordinates for strange attractors from mutual information. , 1986, Physical review. A, General physics.

[51]  Ilya Lipkovich,et al.  Strategies for Identifying Predictive Biomarkers and Subgroups with Enhanced Treatment Effect in Clinical Trials Using SIDES , 2014, Journal of biopharmaceutical statistics.

[52]  I. Lipkovich,et al.  Tutorial in biostatistics: data‐driven subgroup identification and analysis in clinical trials , 2017, Statistics in medicine.

[53]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[54]  G. Clark,et al.  Prognostic factors versus predictive factors: Examples from a clinical trial of erlotinib , 2008, Molecular oncology.

[55]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[56]  Marc Ratkovic,et al.  Estimating treatment effect heterogeneity in randomized program evaluation , 2013, 1305.5682.

[57]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[58]  L. Györfi,et al.  Nonparametric entropy estimation. An overview , 1997 .

[59]  Carsten O. Daub,et al.  The mutual information: Detecting and evaluating dependencies between variables , 2002, ECCB.

[60]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .