Mining Potential Information for Multiclass Microarray Data Using Centroid-Based Dimension Reduction

In this paper, we propose a novel dimension reduction algorithm that implements an information fusion of Centroid-based feature selection and partial least squares (PLS) based feature extraction. This paper focuses on mining the potential information hidden in multiclass microarray data and interpreting the results provided by the potential information. Firstly, a centroid concept has been introduced to define the objective function of feature selection. In order to obtain the sparse solution, logistic regression with L1 regularization has been incorporated into the objective function. The Centroid-based feature selection is then proposed to solve the optimization problem. By using the One-Versus-All (OVA) techniques, the Centroid-based feature selection is extended to solve multiclass problems. Secondly, we perform feature important analysis on microarray data by Centroid-based feature selection to determine the information feature subset (biomarkers). Finally, PLS-based feature extraction is conducted on the selected feature subset to extract the features that best reflect the nature of classification. The proposed algorithm is compared with three state-of-the-art algorithms using eight multiclass microarray datasets. The experimental results demonstrate that the proposed algorithm performs effectively and is competitive. Furthermore, mining the potential information of the microarray dataset improves the interpretability of the results.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[3]  C. A. Murthy,et al.  Multiscale Classification Using Nearest Neighbor Density Estimates , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  R. Fletcher Practical Methods of Optimization , 1988 .

[5]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[7]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[8]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[9]  A. Syvänen,et al.  Digital gene expression profiling of primary acute lymphoblastic leukemia cells , 2011, Leukemia.

[10]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[11]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[12]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[13]  Kim-Anh Lê Cao,et al.  Multiclass classification and gene selection with a stochastic algorithm , 2009, Comput. Stat. Data Anal..

[14]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[15]  Konstantinos Vougas,et al.  The Protein Profile of the Human Immature T-cell Line CCRF-CEM. , 2005, Cancer genomics & proteomics.

[16]  Xin Zhou,et al.  MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data , 2007, Bioinform..