Selection properties of type II maximum likelihood (empirical Bayes) in linear models with individual variance components for predictors

Maximum likelihood (ML) in the linear model overfits when the number of predictors (M) exceeds the number of objects (N). One of the possible solution is the relevance vector machine (RVM) which is a form of automatic relevance detection and has gained popularity in the pattern recognition machine learning community by the famous textbook of Bishop (2006). RVM assigns individual precisions to weights of predictors which are then estimated by maximizing the marginal likelihood (type II ML or empirical Bayes). We investigated the selection properties of RVM both analytically and by experiments in a regression setting. We show analytically that RVM selects predictors when the absolute z-ratio (|least squares estimate|/standard error) exceeds 1 in the case of orthogonal predictors and, for M=2, that this still holds true for correlated predictors when the other z-ratio is large. RVM selects the stronger of two highly correlated predictors. In experiments with real and simulated data, RVM is outcompeted by other popular regularization methods (LASSO and/or PLS) in terms of the prediction performance. We conclude that type II ML is not the general answer in high dimensional prediction problems. In extensions of RVM to obtain stronger selection, improper priors (based on the inverse gamma family) have been assigned to the inverse precisions (variances) with parameters estimated by penalized marginal likelihood. We critically assess this approach and suggest a proper variance prior related to the Beta distribution which gives similar selection and shrinkage properties and allows a fully Bayesian treatment.

[1]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[2]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[3]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[4]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[5]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.

[6]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[7]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[8]  Michael E. Tipping,et al.  Analysis of Sparse Bayesian Learning , 2001, NIPS.

[9]  J. Friedman,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Response , 1993 .

[10]  Gene H. Golub,et al.  Matrix computations , 1983 .

[11]  Kelvin Thompson,et al.  Matrix identities , 1990 .

[12]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[13]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[14]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[15]  Mário A. T. Figueiredo Adaptive Sparseness for Supervised Learning , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  Shizhong Xu,et al.  An expectation–maximization algorithm for the Lasso estimation of quantitative trait locus effects , 2010, Heredity.

[19]  Shizhong Xu,et al.  An Empirical Bayes Method for Estimating Epistatic Effects of Quantitative Trait Loci , 2007, Biometrics.

[20]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[21]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[22]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[23]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[24]  Michael E. Tipping,et al.  Fast Marginal Likelihood Maximisation for Sparse Bayesian Models , 2003 .

[25]  Simon Rogers,et al.  A Bayesian regression approach to the inference of regulatory networks from gene expression data , 2005, Bioinform..

[26]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[27]  Yi Li,et al.  Bayesian automatic relevance determination algorithms for classifying gene expression data. , 2002, Bioinformatics.

[28]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[29]  Cajo J. F. ter Braak,et al.  Bayesian sigmoid shrinkage with improper variance priors and an application to wavelet denoising , 2006, Comput. Stat. Data Anal..

[30]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[31]  James G. Scott,et al.  Alternative Global – Local Shrinkage Priors Using Hypergeometric – Beta Mixtures , 2009 .

[32]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[33]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[34]  Cajo J F ter Braak,et al.  Extending Xu's Bayesian Model for Estimating Polygenic Effects Using Markers of the Entire Genome , 2005, Genetics.

[35]  C. Braak,et al.  Regression by L1 regularization of smart contrasts and sums (ROSCAS) beats PLS and elastic net in latent variable model , 2009 .

[36]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .

[37]  A. E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems.: Biased Estimation for Nonorthogonal Problems. , 2000 .

[38]  I. Johnstone,et al.  Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences , 2004, math/0410088.

[39]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .