Simultaneous variable selection and class fusion with penalized distance criterion based classifiers

Two new methods are proposed to solve the problem of constructing multiclass classifiers, selecting important variables for classification and determining corresponding discriminative variables for each pair of classes simultaneously in the high-dimensional setting. Different from existing methods, which are based on the separate estimation of the precision matrix and mean vectors, the proposed methods construct classifiers by estimating products of the precision matrix and mean vectors or all discriminant directions directly with appropriate penalties. This leads to the use of the distance criterion instead of the log-likelihood used in the existing literature. The proposed methods can not only consistently select important variables for classification but also consistently determine corresponding discriminative variables for each pair of classes. For the multiclass classification problem, conditional misclassification error rates of classifiers constructed by the proposed methods converge to the misclassification error rate of the Bayes rule in probability and rates of convergence are also obtained. Finally, simulations and the real data analysis well demonstrate good performances of the proposed methods in comparison with existing methods.

[1]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[2]  Lixing Zhu,et al.  Covariance-enhanced discriminant analysis. , 2015, Biometrika.

[3]  Runze Li,et al.  Ultrahigh-Dimensional Multiclass Linear Discriminant Analysis by Pairwise Sure Independence Screening , 2016, Journal of the American Statistical Association.

[4]  Runze Li,et al.  Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis , 2015, Journal of the American Statistical Association.

[5]  Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. , 2010, Biostatistics.

[6]  V. Seshan,et al.  HDAC inhibitors and decitabine are highly synergistic and associated with unique gene-expression and epigenetic profiles in models of DLBCL. , 2011, Blood.

[7]  J. Shao,et al.  Sparse linear discriminant analysis by thresholding for high dimensional data , 2011, 1105.3561.

[8]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[9]  Jianqing Fan,et al.  Sparsifying the Fisher linear discriminant by rotation , 2014, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[10]  O. Hino,et al.  Identification of a novel protein (VBP-1) binding to the von Hippel-Lindau (VHL) tumor suppressor gene product. , 1996, Cancer research.

[11]  E. Levina,et al.  Pairwise Variable Selection for High‐Dimensional Model‐Based Clustering , 2010, Biometrics.

[12]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[13]  T. Cai,et al.  A Direct Estimation Approach to Sparse Linear Discriminant Analysis , 2011, 1107.3442.

[14]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[15]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[16]  R. Tibshirani,et al.  Penalized classification using Fisher's linear discriminant , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[17]  Cun-Hui Zhang,et al.  A group bridge approach for variable selection , 2009, Biometrika.

[18]  H. Zou,et al.  A direct approach to sparse discriminant analysis in ultra-high dimensions , 2012 .

[19]  Hui Zou,et al.  The fused Kolmogorov filter: A nonparametric model-free screening method , 2014, 1403.7701.

[20]  T. Golub,et al.  Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. , 2004, Blood.

[21]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[22]  Jill P. Mesirov,et al.  Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets , 2007, PloS one.

[23]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[24]  Yi Yang,et al.  Multiclass Sparse Discriminant Analysis , 2015, 1504.05845.

[25]  Yang Feng,et al.  A road to classification in high dimensional space: the regularized optimal affine discriminant , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[26]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[27]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.