Bayesian Regression Analysis in the "Large p, Small n" Paradigm with Application in DNA Microarray S

Statistical modelling and inference problems in which sample sizes are substantially smaller than the number of available and potentially interesting predictors (explanatory variables) abound in applied science and medicine. These “Large p, Small n” problems pose challenges to standard statistical methods and demand new concepts and models for regression and classification. Our motivating applied context is in functional genomics; more specifically, in studies of phenotyping clinical or physiological outcomes in which the predictors are measured expression levels of large numbers of genes based on high-density DNA microarrays. In a canonical framework of binary regression, we discuss (a) issues of regression modelling utilising singular-value decompositions of design matrices that are massively rank deficient, (b) the imperatives for careful, informative prior specifications on high-dimension regression parameters, (c) the development of new classes of structured prior distributions for this problem, and (d) the development of appropriate computational methods and modes of posterior inference for regression estimation and predictive inference for out-of-sample classification. The latter enterprise is fundamental to genomic phenotyping applications. We study and exemplify the new statistical methodology in a problem of breast cancer phenotyping using DNA microarray expression profiles as predictors, and in discrimination of leukemia types.