Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

We consider the problem of classification using high dimensional features' space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features' means. This reduction is since that only the means of a few selected variables should be estimated. We also consider the design of naive Bayes classifiers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efficient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many "weakly informative" variables, which variable selection type of classification procedures give up on using. We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data.

[1]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[2]  E. Lehmann,et al.  Testing Statistical Hypothesis. , 1960 .

[3]  Wenhua Jiang,et al.  General maximum likelihood empirical Bayes estimation of normal means , 2009, 0908.1709.

[4]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[5]  Ya'acov Ritov,et al.  Asymptotic e-ciency of simple decisions for the compound decision problem ⁄ , 2008, 0802.1319.

[6]  B. Efron Empirical Bayes Estimates for Large-Scale Prediction Problems , 2009, Journal of the American Statistical Association.

[7]  J. B. Copas,et al.  Compound Decisions and Empirical Bayes , 1969 .

[8]  L. Brown In-season prediction of batting averages: A field test of empirical Bayes and Bayes methodologies , 2008, 0803.3697.

[9]  Lawrence D. Brown,et al.  NONPARAMETRIC EMPIRICAL BAYES AND COMPOUND DECISION APPROACHES TO ESTIMATION OF A HIGH-DIMENSIONAL VECTOR OF NORMAL MEANS , 2009, 0908.1712.

[10]  Cun-Hui Zhang,et al.  Compound decision theory and empirical bayes methods , 2003 .

[11]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[12]  L. Brown Admissible Estimators, Recurrent Diffusions, and Insoluble Boundary Value Problems , 1971 .

[13]  Guy Lebanon,et al.  Regularization through variable selection and conditional MLE with application to classification in high dimensions , 2009 .

[14]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[15]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.