Selecting significant genes by randomization test for cancer classification using gene expression data

Gene selection is an important task in bioinformatics studies, because the accuracy of cancer classification generally depends upon the genes that have biological relevance to the classifying problems. In this work, randomization test (RT) is used as a gene selection method for dealing with gene expression data. In the method, a statistic derived from the statistics of the regression coefficients in a series of partial least squares discriminant analysis (PLSDA) models is used to evaluate the significance of the genes. Informative genes are selected for classifying the four gene expression datasets of prostate cancer, lung cancer, leukemia and non-small cell lung cancer (NSCLC) and the rationality of the results is validated by multiple linear regression (MLR) modeling and principal component analysis (PCA). With the selected genes, satisfactory results can be obtained.

[1]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[2]  Satoru Kuhara,et al.  Recursive gene selection based on maximum margin criterion: a comparison with SVM-RFE , 2006, BMC Bioinformatics.

[3]  R. W. Lutz,et al.  Metabolic profiling of glucuronides in human urine by LC-MS/MS and partial least-squares discriminant analysis for classification and prediction of gender. , 2006, Analytical chemistry.

[4]  Lei Liu,et al.  Knowledge guided analysis of microarray data , 2006, J. Biomed. Informatics.

[5]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[6]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[7]  Holger Sültmann,et al.  Global gene expression analysis reveals specific patterns of cell junctions in non-small cell lung cancer subtypes. , 2009, Lung cancer.

[8]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[9]  Gersende Fort,et al.  Classification using partial least squares with penalized logistic regression , 2005, Bioinform..

[10]  M. Tenenhaus,et al.  Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach , 2003, Human Genetics.

[11]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[12]  H Kishino,et al.  Correspondence analysis of genes and tissue types and finding genetic links from microarray data. , 2000, Genome informatics. Workshop on Genome Informatics.

[13]  Lei Liu,et al.  Ensemble gene selection by grouping for microarray data classification , 2010, J. Biomed. Informatics.

[14]  Jeffrey A. Magee,et al.  Expression profiling reveals hepsin overexpression in prostate cancer. , 2001, Cancer research.

[15]  Dimitris Anastassiou,et al.  Inference of Disease-Related Molecular Logic from Systems-Based Microarray Analysis , 2006, PLoS Comput. Biol..

[16]  Krist V. Gernaey,et al.  Classification and Diagnostic Output Prediction of Cancer Using Gene Expression Profiling and Supervised Machine Learning Algorithms , 2008 .

[17]  M. Beckerle,et al.  Purification and characterization of zyxin, an 82,000-dalton component of adherens junctions. , 1991, The Journal of biological chemistry.

[18]  Hau-San Wong,et al.  A neural network-based biomarker association information extraction approach for cancer classification , 2009, J. Biomed. Informatics.

[19]  Xueguang Shao,et al.  Multivariate calibration of near-infrared spectra by using influential variables , 2012 .

[20]  Yonghong Peng,et al.  A novel feature selection approach for biomedical data classification , 2010, J. Biomed. Informatics.

[21]  Xiaosheng Wang,et al.  Accurate molecular classification of cancer using simple rules , 2009, BMC Medical Genomics.

[22]  H. Xiao,et al.  PTRF (polymerase I and transcript-release factor) is tissue-specific and interacts with the BFCOL1 (binding factor of a type-I collagen promoter) zinc-finger transcription factor which binds to the two mouse type-I collagen gene promoters. , 2000, The Biochemical journal.

[23]  Wiklund Ra,et al.  First of two parts , 1997 .

[24]  William Perrizo,et al.  Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis , 2004, J. Biomed. Informatics.

[25]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[26]  Pedro Larrañaga,et al.  Feature selection in Bayesian classifiers for the prognosis of survival of cirrhotic patients treated with TIPS , 2005, J. Biomed. Informatics.

[27]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[28]  Angie Rizzino,et al.  Expression profile of differentially-regulated genes during progression of androgen-independent growth in human prostate cancer cells. , 2002, Carcinogenesis.

[29]  J. Thomas,et al.  An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. , 2001, Genome research.

[30]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[31]  M. Walport Complement. First of two parts. , 2001, The New England journal of medicine.

[32]  E. V. Thomas,et al.  Partial least-squares methods for spectral analyses. 1. Relation to other quantitative calibration methods and the extraction of qualitative information , 1988 .

[33]  Jingjing Liu,et al.  Cancer classification based on microarray gene expression data using a principal component accumulation method , 2011 .

[34]  B. Chandra,et al.  An efficient statistical feature selection approach for classification of gene expression data , 2011, J. Biomed. Informatics.

[35]  R. Fisher THE STATISTICAL UTILIZATION OF MULTIPLE MEASUREMENTS , 1938 .

[36]  S. Dhanasekaran,et al.  Delineation of prognostic biomarkers in prostate cancer , 2001, Nature.

[37]  Xueguang Shao,et al.  Application of latent projective graph in variable selection for near infrared spectral analysis , 2012 .

[38]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[39]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..

[40]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[41]  Tomoyuki Shirai,et al.  Glutathione S-transferase Pi mediates proliferation of androgen-independent prostate cancer cells , 2008, Carcinogenesis.

[42]  I. Halil Kavakli,et al.  Optimization Based Tumor Classification from Microarray Gene Expression Data , 2011, PloS one.

[43]  W. Cai,et al.  A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra , 2008 .

[44]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[45]  Yonghong Peng,et al.  A novel ensemble machine learning for robust microarray data classification , 2006, Comput. Biol. Medicine.

[46]  Wei Chu,et al.  Biomarker discovery in microarray gene expression data with Gaussian processes , 2005, Bioinform..

[47]  L. A. Stone,et al.  Computer Aided Design of Experiments , 1969 .

[48]  Richard Simon,et al.  Microarray-based cancer prediction using single genes , 2011, BMC Bioinformatics.

[49]  Hau-San Wong,et al.  Constructing the gene regulation-level representation of microarray data for cancer classification , 2008, J. Biomed. Informatics.

[50]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[51]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[52]  Wei Kong,et al.  New gene selection method for multiclass tumor classification by class centroid , 2009, J. Biomed. Informatics.

[53]  Pierre R. Bushel,et al.  Computational selection of distinct class- and subclass-specific gene expression signatures , 2002, J. Biomed. Informatics.

[54]  Kaushik Mahata,et al.  Selecting differentially expressed genes using minimum probability of classification error , 2007, J. Biomed. Informatics.

[55]  Xueguang Shao,et al.  A wavelength selection method based on randomization test for near-infrared spectral analysis , 2009 .

[56]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[57]  András Kocsor,et al.  Kalman filtering for disease-state estimation from microarray data , 2006, Bioinform..