Feature selection in omics prediction problems using cat scores and false nondiscovery rate control

We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted $t$-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James--Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package ``sda'' available from the R repository CRAN.

[1]  E. B. Wilson,et al.  The Distribution of Chi-Square. , 1931, Proceedings of the National Academy of Sciences of the United States of America.

[2]  B. Efron The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis , 1975 .

[3]  J. Friedman Regularized Discriminant Analysis , 1989 .

[4]  J. Hintze,et al.  Violin plots : A box plot-density trace synergism , 1998 .

[5]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[6]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[7]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[8]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[9]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[12]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[13]  Trevor Hastie,et al.  Regularized Discriminant Analysis and Its Application in Microarrays , 2004 .

[14]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[15]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[16]  Songcan Chen,et al.  Modified linear discriminant analysis , 2005, Pattern Recognit..

[17]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[18]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[19]  Korbinian Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[20]  John D. Storey,et al.  Optimality Driven Nearest Centroid Classification from Genomic Data , 2007, PloS one.

[21]  Korbinian Strimmer,et al.  A unified approach to false discovery rate estimation , 2008, BMC Bioinformatics.

[22]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[23]  Korbinian Strimmer,et al.  A general modular framework for gene set enrichment analysis , 2009, BMC Bioinformatics.

[24]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[25]  D. Donoho,et al.  Higher criticism thresholding: Optimal feature selection when useful features are rare and weak , 2008, Proceedings of the National Academy of Sciences.

[26]  David J. Spiegelhalter,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Comment. , 2008 .

[27]  Anne-Laure Boulesteix,et al.  CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data , 2008, BMC Bioinformatics.

[28]  Holger Schwender,et al.  Classification with High‐Dimensional Genetic Data: Assigning Patients and Genetic Features to Known Classes , 2008, Biometrical journal. Biometrische Zeitschrift.

[29]  Korbinian Strimmer,et al.  Gene ranking and biomarker discovery under correlation , 2009, Bioinform..

[30]  Ping Xu,et al.  Modified linear discriminant analysis approaches for classification of high-dimensional microarray data , 2009, Comput. Stat. Data Anal..

[31]  B. Efron Empirical Bayes Estimates for Large-Scale Prediction Problems , 2009, Journal of the American Statistical Association.

[32]  R. Tibshirani,et al.  Covariance‐regularized regression and classification for high dimensional problems , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[33]  Korbinian Strimmer,et al.  Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks , 2008, J. Mach. Learn. Res..