High-dimensional prediction of binary outcomes in the presence of between-study heterogeneity

Many prediction methods have been proposed in the literature, but most of them ignore heterogeneity between populations. Either only data from a single study or population is available for model building and evaluation, or when data from multiple studies make up the training dataset, studies are pooled before model building. As a result, prediction models might perform less than expected when applied to new subjects from new study populations. We propose a linear method for building prediction models with high-dimensional data from multiple studies. Our method explicitly addresses between-population variability and tends to select predictors that are predictive in most of the study populations. We employ empirical Bayes estimators and hence avoid selection bias during the variable selection process. Simulation results demonstrate that the new method works better than other linear prediction methods that ignore the between-study variability. Our method is developed for classification into two groups.

[1]  Alex Deng,et al.  Demystifying the Bias from Selective Inference: a Revisit to Dawid's Treatment Selection Problem , 2016, 1601.05835.

[2]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[3]  K. Famulski,et al.  Molecular Diagnosis of T Cell‐Mediated Rejection in Human Kidney Transplant Biopsies , 2013, American journal of transplantation : official journal of the American Society of Transplantation and the American Society of Transplant Surgeons.

[4]  B. Efron Are a set of microarrays independent of each other? , 2009, The annals of applied statistics.

[5]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[6]  Francesco Vallania,et al.  Methods to increase reproducibility in differential gene expression via meta-analysis , 2016, Nucleic acids research.

[7]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[8]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[9]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[11]  A. Matas,et al.  Potential Impact of Microarray Diagnosis of T Cell–Mediated Rejection in Kidney Transplants: The INTERCOM Study , 2013, American journal of transplantation : official journal of the American Society of Transplantation and the American Society of Transplant Surgeons.

[12]  R. Tibshirani,et al.  Using specially designed exponential families for density estimation , 1996 .

[13]  Alexander A. Morgan,et al.  A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation , 2013, The Journal of experimental medicine.

[14]  A. P. Dawid,et al.  Selection paradoxes of Bayesian inference , 1994 .

[15]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[16]  H. Thierens,et al.  Pitfalls in Prediction Modeling for Normal Tissue Toxicity in Radiation Therapy: An Illustration With the Individual Radiation Sensitivity and Mammary Carcinoma Risk Factor Investigation Cohorts. , 2016, International journal of radiation oncology, biology, physics.

[17]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[18]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[19]  D G Altman,et al.  What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[20]  Jeff Reeve,et al.  A molecular classifier for predicting future graft loss in late kidney transplant biopsies. , 2010, The Journal of clinical investigation.

[21]  B. Efron Empirical Bayes Estimates for Large-Scale Prediction Problems , 2009, Journal of the American Statistical Association.

[22]  A. Evans,et al.  Translating Clinical Research into Clinical Practice: Impact of Using Prediction Rules To Make Decisions , 2006, Annals of Internal Medicine.

[23]  S. Senn A Note Concerning a Selection “Paradox” of Dawid's , 2008 .

[24]  B. Efron Tweedie’s Formula and Selection Bias , 2011, Journal of the American Statistical Association.

[25]  K. Covinsky,et al.  Assessing the Generalizability of Prognostic Information , 1999, Annals of Internal Medicine.

[26]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[27]  G H Guyatt,et al.  Users' guides to the medical literature: XXII: how to use articles about clinical decision rules. Evidence-Based Medicine Working Group. , 2000, JAMA.

[28]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .