SPARSE QUADRATIC DISCRIMINANT ANALYSIS FOR HIGH DIMENSIONAL DATA

Many contemporary studies involve the classification of a subject into two classes based on n observations of the p variables associated with the subject. Under the assumption that the variables are normally distributed, the well-known linear discriminant analysis (LDA) assumes a common covariance matrix over the two classes while the quadratic discriminant analysis (QDA) allows different covariance matrices. When p is much smaller than n, even if they both diverge, the LDA and QDA have the smallest asymptotic misclassification rates for the cases of equal and unequal covariance matrices, respectively. However, modern statistical studies often face classification problems with the number of variables much larger than the sample size n, and the classical LDA and QDA can perform poorly. In fact, we give an example in which the QDA performs as poorly as random guessing even if we know the true covariances. Under some sparsity conditions on the unknown means and covariance matrices of the two classes, we propose a sparse QDA based on thresholding that has the smallest asymptotic misclassification rate conditional on the training data. We discuss an example of classifying normal and tumor colon tissues based on a set of p = 2; 000 genes and a sample of size n = 62, and another example of a cardiovascular study for n = 222 subjects with p = 2; 434 genes. A simulation is also conducted to check the performance of the proposed method.

[1]  Noah Simon,et al.  Discriminant Analysis with Adaptively Pooled Covariance , 2011 .

[2]  Asymptotic probabilities of misclassification of two discriminant functions in cases of high dimensional data , 2004 .

[3]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[4]  Adam J. Rothman,et al.  Sparse permutation invariant covariance estimation , 2008, 0801.4837.

[5]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[6]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[7]  Alexandr A. Borovkov,et al.  Limit Theorems of Probability Theory. , 2011 .

[8]  J. Shao,et al.  Sparse linear discriminant analysis by thresholding for high dimensional data , 2011, 1105.3561.

[9]  L. Saulis,et al.  Limit theorems for large deviations , 1991 .

[10]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[11]  Yang Feng,et al.  A road to classification in high dimensional space: the regularized optimal affine discriminant , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[12]  Jianhua Z. Huang,et al.  Sparse Linear Discriminant Analysis with Applications to High Dimensional Low Sample Size Data , 2009 .

[13]  Tuo Zhao,et al.  CODA: high dimensional copula discriminant analysis , 2013, J. Mach. Learn. Res..

[14]  Hansheng Wang,et al.  On BIC's Selection Consistency for Discriminant Analysis , 2008 .

[15]  Pascal J. Goldschmidt-Clermont,et al.  Gene Expression Patterns in Peripheral Blood Correlate with the Extent of Coronary Artery Disease , 2009, PloS one.

[16]  Trevor J. Hastie,et al.  Sparse Discriminant Analysis , 2011, Technometrics.

[17]  T. Cai,et al.  A Direct Estimation Approach to Sparse Linear Discriminant Analysis , 2011, 1107.3442.

[18]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[20]  V. Statulevičius,et al.  Limit Theorems of Probability Theory , 2000 .