Improved centroids estimation for the nearest shrunken centroid classifier

MOTIVATION The nearest shrunken centroid (NSC) method has been successfully applied in many DNA-microarray classification problems. The NSC uses 'shrunken' centroids as prototypes for each class and identifies subsets of genes that best characterize each class. Classification is then made to the nearest (shrunken) centroid. The NSC is very easy to implement and very easy to interpret, however, it has drawbacks. RESULTS We show that the NSC method can be interpreted in the framework of LASSO regression. Based on that, we consider two new methods, adaptive L(infinity)-norm penalized NSC (ALP-NSC) and adaptive hierarchically penalized NSC (AHP-NSC), with two different penalty functions for microarray classification, which improve over the NSC. Unlike the L(1)-norm penalty used in LASSO, the penalty terms that we consider make use of the fact that parameters belonging to one gene should be treated as a natural group. Numerical results indicate that the two new methods tend to remove irrelevant genes more effectively and provide better classification results than the L(1)-norm approach. AVAILABILITY R code for the ALP-NSC and the AHP-NSC algorithms are available from authors upon request.

[1]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[2]  ZhangHao Helen,et al.  Gene selection using support vector machines with non-convex penalty , 2006 .

[3]  Baolin Wu,et al.  Differential gene expression detection and sample classification using penalized linear regression models , 2006, Bioinform..

[4]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Alan R. Dabney BIOINFORMATICS Classification of Microarrays to Nearest Centroids , 2022 .

[6]  Debashis Ghosh,et al.  Eigengene-based linear discriminant model for tumor classification using gene expression microarray data , 2006, Bioinform..

[7]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[8]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[9]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[10]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[11]  Xiaodong Lin,et al.  Gene expression Gene selection using support vector machines with non-convex penalty , 2005 .

[12]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[13]  Yufeng Liu,et al.  Multicategory ψ-Learning , 2006 .

[14]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[15]  Xiaotong Shen,et al.  Adaptive Model Selection , 2002 .

[16]  H. Zou,et al.  The F ∞ -norm support vector machine , 2008 .

[17]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[18]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[19]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[22]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[24]  Ji Zhu,et al.  Variable selection for multicategory SVM via sup-norm regularization , 2006 .