Multi-step dimensionality reduction and semi-supervised graph-based tumor classification using gene expression data

OBJECTIVE Both supervised methods and unsupervised methods have been widely used to solve the tumor classification problem based on gene expression profiles. This paper introduces a semi-supervised graph-based method for tumor classification. Feature extraction plays a key role in tumor classification based on gene expression profiles, and can greatly improve the performance of a classifier. In this paper we propose a novel multi-step dimensionality reduction method for extracting tumor-related features. METHODS AND MATERIALS First the Wilcoxon rank-sum test is used for gene selection. Then gene ranking and discrete cosine transform are combined with principal component analysis for feature extraction. Finally, the performance is evaluated by semi-supervised learning algorithms. RESULTS To show the validity of the proposed method, we apply it to classify four tumor datasets involving various human normal and tumor tissue samples. The experimental results show that the proposed method is efficient and feasible. Compared with other methods, our method can achieve relatively higher prediction accuracy. Particularly, it is found that semi-supervised method is superior to support vector machines in classification performance. CONCLUSIONS The proposed approach can effectively improve the performance of tumor classification based on gene expression profiles. This work is a meaningful attempt to explore and apply multi-step dimensionality reduction and semi-supervised learning methods in the field of tumor classification. Considering the high classification accuracy, there should be much room for the application of multi-step dimensionality reduction and semi-supervised learning methods to perform tumor classification.

[1]  Jian Pei,et al.  A rank sum test method for informative gene discovery , 2004, KDD.

[2]  Huowang Chen,et al.  Feature Extraction from Tumor Gene Expression Profiles Using DCT and DFT , 2007, EPIA Workshops.

[3]  Bin Yu,et al.  Simultaneous Gene Clustering and Subset Selection for Sample Classification Via MDL , 2003, Bioinform..

[4]  Matthias Seeger,et al.  Learning from Labeled and Unlabeled Data , 2010, Encyclopedia of Machine Learning.

[5]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[6]  D. Ghosh Penalized Discriminant Methods for the Classification of Tumors from Gene Expression Data , 2003, Biometrics.

[7]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[8]  De-Shuang Huang,et al.  Non-linear cancer classification using a modified radial basis function classification algorithm. , 2005, Journal of biomedical science.

[9]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[10]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[11]  Kohei Inoue,et al.  Dimensionality Reduction for Semi-supervised Face Recognition , 2005, FSKD.

[12]  D. Covell,et al.  Molecular classification of cancer: unsupervised self-organizing map analysis of gene expression microarray data. , 2003, Molecular cancer therapeutics.

[13]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[14]  Jieping Ye,et al.  Using uncorrelated discriminant analysis for tissue classification with gene expression data , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Cor J. Veenman,et al.  A protocol for building and evaluating predictors of disease state based on microarray data , 2005, Bioinform..

[16]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[17]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[18]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[20]  Jerzy W. Grzymala-Busse,et al.  Mining of MicroRNA Expression Data - A Rough Set Approach , 2006, RSKT.

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[22]  Gary Chartrand,et al.  Introduction to Graph Theory , 2004 .

[23]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[24]  Lei Zhang,et al.  Tumor Clustering Using Nonnegative Matrix Factorization With Gene Selection , 2009, IEEE Transactions on Information Technology in Biomedicine.

[25]  N. Ahmed,et al.  Discrete Cosine Transform , 1996 .

[26]  Tobias Scheffer,et al.  Multi-Relational Learning, Text Mining, and Semi-Supervised Learning for Functional Genomics , 2004, Machine Learning.

[27]  J. Sunil Rao,et al.  Statistical Redundancy Testing for Improved Gene Selection in Cancer Classification Using Microarray Data , 2007, Cancer informatics.

[28]  Vojislav Kecman,et al.  Semi-supervised learning from unbalanced labeled data: An improvement , 2006 .

[29]  Bart De Moor,et al.  Bayesian applications of belief networks and multilayer perceptrons for ovarian tumor classification with rejection , 2003, Artif. Intell. Medicine.

[30]  Gustavo Camps-Valls,et al.  Semi-Supervised Graph-Based Hyperspectral Image Classification , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[31]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[33]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[34]  Wei Pan,et al.  Semi-supervised learning via penalized mixture model with application to microarray sample classification , 2006, Bioinform..

[35]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[36]  N. Iizuka,et al.  MECHANISMS OF DISEASE Mechanisms of disease , 2022 .

[37]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[38]  Antonia J. Jones,et al.  Feature selection for genetic sequence classification , 1998, Bioinform..

[39]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[41]  Sung-Bae Cho,et al.  Machine Learning in DNA Microarray Analysis for Cancer Classification , 2003, APBC.

[42]  Jie Gui,et al.  Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction , 2010, Comput. Biol. Medicine.

[43]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[44]  M. Dehmer,et al.  Analysis of Microarray Data: A Network-Based Approach , 2008 .

[45]  E. Lehmann,et al.  Nonparametrics: Statistical Methods Based on Ranks , 1976 .

[46]  Johan A. K. Suykens,et al.  Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction , 2004, Bioinform..

[47]  Jie Gui,et al.  Factor Analysis for Cross-Platform Tumor Classification Based on Gene Expression Profiles , 2010, J. Circuits Syst. Comput..

[48]  Giorgio Valentini,et al.  Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses , 2006, Artif. Intell. Medicine.

[49]  Paul Terry,et al.  Application of the GA/KNN method to SELDI proteomics data , 2004, Bioinform..

[50]  Adil M. Bagirov,et al.  New algorithms for multi-class cancer diagnosis using tumor gene expression signatures , 2003, Bioinform..

[51]  B. Schölkopf,et al.  A Regularization Framework for Learning from Graph Data , 2004, ICML 2004.