论文信息 - Distinguishing prognostic and predictive biomarkers: an information theoretic approach

Distinguishing prognostic and predictive biomarkers: an information theoretic approach

Abstract Motivation The identification of biomarkers to support decision-making is central to personalized medicine, in both clinical and research scenarios. The challenge can be seen in two halves: identifying predictive markers, which guide the development/use of tailored therapies; and identifying prognostic markers, which guide other aspects of care and clinical trial planning, i.e. prognostic markers can be considered as covariates for stratification. Mistakenly assuming a biomarker to be predictive, when it is in fact largely prognostic (and vice-versa) is highly undesirable, and can result in financial, ethical and personal consequences. We present a framework for data-driven ranking of biomarkers on their prognostic/predictive strength, using a novel information theoretic method. This approach provides a natural algebra to discuss and quantify the individual predictive and prognostic strength, in a self-consistent mathematical framework. Results Our contribution is a novel procedure, INFO+, which naturally distinguishes the prognostic versus predictive role of each biomarker and handles higher order interactions. In a comprehensive empirical evaluation INFO+ outperforms more complex methods, most notably when noise factors dominate, and biomarkers are likely to be falsely identified as predictive, when in fact they are just strongly prognostic. Furthermore, we show that our methods can be 1–3 orders of magnitude faster than competitors, making it useful for biomarker discovery in ‘big data’ scenarios. Finally, we apply our methods to identify predictive biomarkers on two real clinical trials, and introduce a new graphical representation that provides greater insight into the prognostic and predictive strength of each biomarker. Availability and implementation R implementations of the suggested methods are available at https://github.com/sechidis. Supplementary information Supplementary data are available at Bioinformatics online.

[1] John L.P. Thompson,et al. Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[2] I. Lipkovich,et al. Subgroup identification based on differential effect search—A recursive partitioning method for establishing response to treatment in patient subpopulations , 2011, Statistics in medicine.

[3] Gavin Brown,et al. Simple strategies for semi-supervised feature selection , 2017, Machine Learning.

[4] Alan Agresti,et al. Categorical Data Analysis , 2003 .

[5] Max Kuhn,et al. Building Predictive Models in R Using the caret Package , 2008 .

[6] K. Strimmer,et al. Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[7] M. Scutari,et al. Bayesian Network Structure Learning with Permutation Tests , 2011, 1101.5184.

[8] Michael Mitzenmacher,et al. Detecting Novel Associations in Large Data Sets , 2011, Science.

[9] Kurt Ulm,et al. Responder identification in clinical trials with censored data , 2006, Comput. Stat. Data Anal..

[10] Lu Tian,et al. A Simple Method for Detecting Interactions between a Treatment and a Large Number of Covariates , 2012, 1212.2995.

[11] Trevor Hastie,et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[12] J. M. Taylor,et al. Subgroup identification from randomized clinical trial data , 2011, Statistics in medicine.

[13] Korbinian Strimmer,et al. Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks , 2008, J. Mach. Learn. Res..

[14] R. Simon,et al. Adaptive Signature Design: An Adaptive Clinical Trial Design for Generating and Prospectively Testing A Gene Expression Signature for Sensitive Patients , 2005, Clinical Cancer Research.

[15] Gert Mayer,et al. Rosuvastatin and cardiovascular events in patients undergoing hemodialysis. , 2009, The New England journal of medicine.

[16] T. Mok,et al. Gefitinib or carboplatin-paclitaxel in pulmonary adenocarcinoma. , 2009, The New England journal of medicine.

[17] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[18] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[19] K. Ballman,et al. Biomarker: Predictive or Prognostic? , 2015, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[20] Hadley Wickham,et al. Graphics for Statistics and Data Analysis with R , 2010 .

[21] Guoping Zeng,et al. A Unified Definition of Mutual Information with Applications in Machine Learning , 2015 .

[22] Nor Hayati Othman,et al. A review of feature selection techniques via gene expression profiles , 2008, 2008 International Symposium on Information Technology.

[23] Gavin Brown,et al. Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[24] Pedro Larrañaga,et al. A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[25] Lloyd,et al. Use of mutual information to decrease entropy: Implications for the second law of thermodynamics. , 1989, Physical review. A, General physics.

[26] T. Friede,et al. Methods for identification and confirmation of targeted subgroups in clinical trials: A systematic review , 2016, Journal of biopharmaceutical statistics.

[27] James Bailey,et al. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[28] Tarald O. Kvålseth,et al. On Normalized Mutual Information: Measure Derivations and Properties , 2017, Entropy.

[29] W. Loh,et al. A regression tree approach to identifying subgroups with differential treatment effects , 2014, Statistics in medicine.

[30] F. Zannad,et al. Determinants of Cardiovascular Risk in Haemodialysis Patients: Post hoc Analyses of the AURORA Study , 2013, American Journal of Nephrology.

[31] David R. Brillinger,et al. Some data analyses using mutual information , 2004 .