Block-diagonal discriminant analysis and its bias-corrected rules

Abstract High-throughput expression profiling allows simultaneous measure of tens of thousands of genes at once. These data have motivated the development of reliable biomarkers for disease subtypes identification and diagnosis. Many methods have been developed in the literature for analyzing these data, such as diagonal discriminant analysis, support vector machines, and k-nearest neighbor methods. The diagonal discriminant methods have been shown to perform well for high-dimensional data with small sample sizes. Despite its popularity, the independence assumption is unlikely to be true in practice. Recently, a gene module based linear discriminant analysis strategy has been proposed by utilizing the correlation among genes in discriminant analysis. However, the approach can be underpowered when the samples of the two classes are unbalanced. In this paper, we propose to correct the biases in the discriminant scores of block-diagonal discriminant analysis. In simulation studies, our proposed method outperforms other approaches in various settings. We also illustrate our proposed discriminant analysis method for analyzing microarray data studies.

[1]  Ulrich Bodenhofer,et al.  APCluster: an R package for affinity propagation clustering , 2011, Bioinform..

[2]  George Henry Dunteman,et al.  Introduction To Multivariate Analysis , 1984 .

[3]  M. A. Moran,et al.  A Closer Look at Two Alternative Methods of Statistical Discrimination , 1979 .

[4]  Xiaogang Wang,et al.  CLUES: A non-parametric clustering method based on local shrinking , 2007, Comput. Stat. Data Anal..

[5]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[6]  Ji Zhu,et al.  Improved centroids estimation for the nearest shrunken centroid classifier , 2007, Bioinform..

[7]  Grace S. Shieh,et al.  Comparison of Support Vector Machines to Other Classifiers Using Gene Expression Data , 2006 .

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Sophie Lambert-Lacroix,et al.  Effective dimension reduction methods for tumor classification using gene expression data , 2003, Bioinform..

[10]  Joel Dudley,et al.  Network-Based Elucidation of Human Disease Similarities Reveals Common Functional Modules Enriched for Pluripotent Drug Targets , 2010, PLoS Comput. Biol..

[11]  David Warde-Farley,et al.  Dynamic modularity in protein interaction networks predicts breast cancer outcome , 2009, Nature Biotechnology.

[12]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[13]  G. Wahba,et al.  Multicategory Support Vector Machines , Theory , and Application to the Classification of Microarray Data and Satellite Radiance Data , 2004 .

[14]  John,et al.  ESTIMATING THE POSITIVE FALSE DISCOVERY RATE UNDER DEPENDENCE, WITH APPLICATIONS TO DNA MICROARRAYS by , 2007 .

[15]  Mayer Aladjem,et al.  Regularized discriminant analysis for face recognition , 2004, Pattern Recognit..

[16]  A. J. Collins,et al.  Introduction To Multivariate Analysis , 1981 .

[17]  David M. Rocke,et al.  Dimension Reduction for Classification with Gene Expression Microarray Data , 2006, Statistical applications in genetics and molecular biology.

[18]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[19]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[20]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[22]  B. Lindqvist,et al.  Estimating the proportion of true null hypotheses, with application to DNA microarray data , 2005 .

[23]  Tiejun Tong,et al.  Optimal Shrinkage Estimation of Variances With Applications to Microarray Data Analysis , 2007 .

[24]  John D. Storey,et al.  Optimality Driven Nearest Centroid Classification from Genomic Data , 2007, PloS one.

[25]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[26]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[27]  R. Kronmal,et al.  Discriminant functions when covariances are unequal and sample sizes are moderate , 1977 .

[28]  Palaiahnakote Shivakumara,et al.  Diagonal Fisher linear discriminant analysis for efficient face recognition , 2006, Neurocomputing.

[29]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[30]  Yufeng Liu,et al.  Adaptive Weighted Learning for Unbalanced Multicategory Classification , 2009, Biometrics.

[31]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[32]  Zhigen Zhao,et al.  Empirical Bayes confidence intervals shrinking both means and variances , 2009 .

[33]  Yi Lin Multicategory Support Vector Machines, Theory, and Application to the Classification of . . . , 2003 .

[34]  Jun Dong,et al.  Geometric Interpretation of Gene Coexpression Network Analysis , 2008, PLoS Comput. Biol..

[35]  Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. , 2010, Biostatistics.

[36]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[37]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[38]  Jieping Ye,et al.  Using uncorrelated discriminant analysis for tissue classification with gene expression data , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[39]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[40]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[41]  Hui Jiang,et al.  Gene Network Modules-Based Liner Discriminant Analysis of Microarray Gene Expression Data , 2011, ISBRA.

[42]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[43]  Thomas P. Shanley,et al.  Genomic expression profiling across the pediatric systemic inflammatory response syndrome, sepsis, and septic shock spectrum* , 2009, Critical care medicine.

[44]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[45]  René Natowicz,et al.  Prediction of the outcome of preoperative chemotherapy in breast cancer using DNA probes that provide information on both complete and incomplete responses , 2008, BMC Bioinformatics.

[46]  Yillbyung Lee,et al.  The Fusion of Two User-friendly Biometric Modalities: Iris and Face , 2006, IEICE Trans. Inf. Syst..

[47]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[48]  Tiejun Tong,et al.  Bias‐Corrected Diagonal Discriminant Rules for High‐Dimensional Classification , 2010, Biometrics.

[49]  Ingram Olkin,et al.  Unbiased Estimation of Some Multivariate Probability Densities and Related Functions , 1969 .

[50]  Tiejun Tong,et al.  Shrinkage‐based Diagonal Discriminant Analysis and Its Applications in High‐Dimensional Data , 2009, Biometrics.