Bayesian variable selection for high dimensional predictors and self-reported outcomes

Background The onset of silent diseases such as type 2 diabetes is often registered through self-report in large prospective cohorts. Self-reported outcomes are cost-effective; however, they are subject to error. Diagnosis of silent events may also occur through the use of imperfect laboratory-based diagnostic tests. In this paper, we describe an approach for variable selection in high dimensional datasets for settings in which the outcome is observed with error. Methods We adapt the spike and slab Bayesian Variable Selection approach in the context of error-prone, self-reported outcomes. The performance of the proposed approach is studied through simulation studies. An illustrative application is included using data from the Women’s Health Initiative SNP Health Association Resource, which includes extensive genotypic (>900,000 SNPs) and phenotypic data on 9,873 African American and Hispanic American women. Results Simulation studies show improved sensitivity of our proposed method when compared to a naive approach that ignores error in the self-reported outcomes. Application of the proposed method resulted in discovery of several single nucleotide polymorphisms (SNPs) that are associated with risk of type 2 diabetes in a dataset of 9,873 African American and Hispanic participants in the Women’s Health Initiative. There was little overlap among the top ranking SNPs associated with type 2 diabetes risk between the racial groups, adding support to previous observations in the literature of disease associated genetic loci that are often not generalizable across race/ethnicity populations. The adapted Bayesian variable selection algorithm is implemented in R. The source code for the simulations are available in the Supplement . Conclusions Variable selection accuracy is reduced when the outcome is ascertained by error-prone self-reports. For this setting, our proposed algorithm has improved variable selection performance when compared to approaches that neglect to account for the error-prone nature of self-reports.

[1]  Michael Höhle,et al.  Identifying the source of food-borne disease outbreaks: An application of Bayesian variable selection , 2017, Statistical methods in medical research.

[2]  M. Stephens,et al.  Bayesian variable selection regression for genome-wide association studies and other large-scale problems , 2011, 1110.6019.

[3]  Nicholas P. Jewell,et al.  Misclassification of current status data , 2010, Lifetime data analysis.

[4]  J. Neuhaus Bias and efficiency loss due to misclassified responses in binary regression , 1999 .

[5]  R. O’Hara,et al.  A review of Bayesian variable selection methods: what, how and which , 2009 .

[6]  Michael R. Kosorok,et al.  Analysis of Time-to-Event Data With Incomplete Event Adjudication , 2004 .

[7]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[8]  M. Marazita,et al.  Genome-wide Association Studies , 2012, Journal of dental research.

[9]  R. Balasubramanian,et al.  SEMIPARAMETRIC TIME TO EVENT MODELS IN THE PRESENCE OF ERROR-PRONE, SELF-REPORTED OUTCOMES-WITH APPLICATION TO THE WOMEN'S HEALTH INITIATIVE. , 2015, The annals of applied statistics.

[10]  D. Dunson,et al.  Bayesian Selection and Clustering of Polymorphisms in Functionally Related Genes , 2008 .

[11]  J. Hughes,et al.  Discrete Proportional Hazards Models for Mismeasured Outcomes , 2003, Biometrics.

[12]  R. Balasubramanian,et al.  Comparative Evaluation of Classifiers in the Presence of Statistical Interactions between Features in High Dimensional Data Settings , 2012, The international journal of biostatistics.

[13]  Veronika Rockova,et al.  Incorporating grouping information in bayesian variable selection with applications in genomics , 2014 .

[14]  N. Jeoung Pyruvate Dehydrogenase Kinases: Therapeutic Targets for Diabetes and Cancers , 2015, Diabetes & metabolism journal.

[15]  Jennifer G. Robinson,et al.  Validity of diabetes self-reports in the Women's Health Initiative: comparison with medication inventories and fasting glucose measurements , 2008, Clinical trials.

[16]  JoAnn E. Manson,et al.  Design of the Women's Health Initiative clinical trial and observational study. The Women's Health Initiative Study Group. , 1998, Controlled clinical trials.

[17]  Philip Heidelberger,et al.  Simulation Run Length Control in the Presence of an Initial Transient , 1983, Oper. Res..

[18]  H. Highland,et al.  Five Linkage Regions Each Harbor Multiple Type 2 Diabetes Genes in the African American Subset of the GENNID Study , 2013, Journal of Human Genetics.

[19]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[20]  Veronika Rockova,et al.  EMVS: The EM Approach to Bayesian Variable Selection , 2014 .

[21]  Raji Balasubramanian,et al.  Estimation of a failure time distribution based on imperfect diagnostic tests , 2003 .

[22]  C. Carty,et al.  Replication of Breast Cancer GWAS Susceptibility Loci in the Women's Health Initiative African American SHARe Study , 2011, Cancer Epidemiology, Biomarkers & Prevention.

[23]  D. Finkelstein,et al.  A proportional hazards model for interval-censored failure time data. , 1986, Biometrics.

[24]  Marina Vannucci,et al.  Bayesian Variable Selection in Multinomial Probit Models to Identify Molecular Signatures of Disease Stage , 2004, Biometrics.

[25]  Siying Chen,et al.  Bayesian variable selection for post‐analytic interrogation of susceptibility loci , 2017, Biometrics.

[26]  N. Zhang,et al.  Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces With Applications in Genomics , 2010 .

[27]  X. Wehrens,et al.  Effects of CaMKII-Mediated Phosphorylation of Ryanodine Receptor Type 2 on Islet Calcium Handling, Insulin Secretion, and Glucose Tolerance , 2013, PLoS ONE.

[28]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[29]  Marina Vannucci,et al.  Bioinformatics Original Paper Bayesian Variable Selection for the Analysis of Microarray Data with Censored Outcomes , 2022 .

[30]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[31]  S. Lagakos,et al.  Estimation of the Timing of Perinatal Transmission of HIV , 2001, Biometrics.

[32]  Jason H. Moore,et al.  Chapter 11: Genome-Wide Association Studies , 2012, PLoS Comput. Biol..

[33]  T. Assimes,et al.  Genome‐wide scan for circulating vascular adhesion protein‐1 levels: MACROD2 as a potential transcriptional regulator of adipogenesis , 2018, Journal of diabetes investigation.

[34]  H. Boezen,et al.  Genome-wide association studies: what do they teach us about asthma and chronic obstructive pulmonary disease? , 2009, Proceedings of the American Thoracic Society.

[35]  B. Turnbull The Empirical Distribution Function with Arbitrarily Grouped, Censored, and Truncated Data , 1976 .

[36]  T. Cook Adjusting survival analysis for the presence of unadjudicated study events. , 2000, Controlled clinical trials.

[37]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..

[38]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[39]  E. Donadi,et al.  Transcriptome meta-analysis of peripheral lymphomononuclear cells indicates that gestational diabetes is closer to type 1 diabetes than to type 2 diabetes mellitus , 2013, Molecular Biology Reports.

[40]  S. Snapinn,et al.  Survival analysis with uncertain endpoints. , 1998, Biometrics.

[41]  Francesco C Stingo,et al.  INCORPORATING BIOLOGICAL INFORMATION INTO LINEAR MODELS: A BAYESIAN APPROACH TO THE SELECTION OF PATHWAYS AND GENES. , 2011, The annals of applied statistics.

[42]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[43]  Marina Vannucci,et al.  Variable selection in clustering via Dirichlet process mixture models , 2006 .