Bayesian analysis of binary prediction tree models for retrospectively sampled outcomes.

Classification tree models are flexible analysis tools which have the ability to evaluate interactions among predictors as well as generate predictions for responses of interest. We describe Bayesian analysis of a specific class of tree models in which binary response data arise from a retrospective case-control design. We are also particularly interested in problems with potentially very many candidate predictors. This scenario is common in studies concerning gene expression data, which is a key motivating example context. Innovations here include the introduction of tree models that explicitly address and incorporate the retrospective design, and the use of nonparametric Bayesian models involving Dirichlet process priors on the distributions of predictor variables. The model specification influences the generation of trees through Bayes' factor based tests of association that determine significant binary partitions of nodes during a process of forward generation of trees. We describe this constructive process and discuss questions of generating and combining multiple trees via Bayesian model averaging for prediction. Additional discussion of parameter selection and sensitivity is given in the context of an example which concerns prediction of breast tumour status utilizing high-dimensional gene expression data; the example demonstrates the exploratory/explanatory uses of such models as well as their primary utility in prediction. Shortcomings of the approach and comparison with alternative tree modelling algorithms are also discussed, as are issues of modelling and computational extensions.

[1]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[2]  M. West,et al.  Gene expression predictors of breast cancer outcomes , 2003, The Lancet.

[3]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[4]  M. West,et al.  Gene Expression Phenotypes of Atherosclerosis , 2004, Arteriosclerosis, thrombosis, and vascular biology.

[5]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[6]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[7]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[8]  Gerhard Tutz,et al.  A CART-based approach to discover emerging patterns in microarray data , 2003, Bioinform..

[9]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[11]  T. Fearn,et al.  The choice of variables in multivariate regression: a non-conjugate Bayesian decision theory approach , 1999 .

[12]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[14]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[15]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[16]  T. Fearn,et al.  Application of near infrared reflectance spectroscopy to the compositional analysis of biscuits and biscuit doughs , 1984 .

[17]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[18]  Daryl Pregibon,et al.  Tree-based models , 1992 .

[19]  Adrian F. M. Smith,et al.  A Bayesian CART algorithm , 1998 .

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  M. J. Bayarri,et al.  Calibration of ρ Values for Testing Precise Null Hypotheses , 2001 .

[22]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .