Improving accuracy for cancer classification with a new algorithm for genes selection

BackgroundEven though the classification of cancer tissue samples based on gene expression data has advanced considerably in recent years, it faces great challenges to improve accuracy. One of the challenges is to establish an effective method that can select a parsimonious set of relevant genes. So far, most methods for gene selection in literature focus on screening individual or pairs of genes without considering the possible interactions among genes. Here we introduce a new computational method named the Binary Matrix Shuffling Filter (BMSF). It not only overcomes the difficulty associated with the search schemes of traditional wrapper methods and overfitting problem in large dimensional search space but also takes potential gene interactions into account during gene selection. This method, coupled with Support Vector Machine (SVM) for implementation, often selects very small number of genes for easy model interpretability.ResultsWe applied our method to 9 two-class gene expression datasets involving human cancers. During the gene selection process, the set of genes to be kept in the model was recursively refined and repeatedly updated according to the effect of a given gene on the contributions of other genes in reference to their usefulness in cancer classification. The small number of informative genes selected from each dataset leads to significantly improved leave-one-out (LOOCV) classification accuracy across all 9 datasets for multiple classifiers. Our method also exhibits broad generalization in the genes selected since multiple commonly used classifiers achieved either equivalent or much higher LOOCV accuracy than those reported in literature.ConclusionsEvaluation of a gene’s contribution to binary cancer classification is better to be considered after adjusting for the joint effect of a large number of other genes. A computationally efficient search scheme was provided to perform effective search in the extensive feature space that includes possible interactions of many genes. Performance of the algorithm applied to 9 datasets suggests that it is possible to improve the accuracy of cancer classification by a big margin when joint effects of many genes are considered.

[1]  Jan M. Van Campenhout,et al.  On the Possible Orderings in the Measurement Selection Problem , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Yang Zhang,et al.  Feature selection for support vector machines with RBF kernel , 2011, Artificial Intelligence Review.

[4]  Hong-Wen Deng,et al.  Gene selection for classification of microarray data based on the Bayes error , 2007, BMC Bioinformatics.

[5]  P. Crocker,et al.  Characterization of CD33 as a new member of the sialoadhesin family of cellular interaction molecules. , 1995, Blood.

[6]  Zili Zhang,et al.  A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data , 2010, BMC Bioinformatics.

[7]  M. Bertagnolli,et al.  Molecular origins of cancer: Molecular basis of colorectal cancer. , 2009, The New England journal of medicine.

[8]  Sotiris B. Kotsiantis,et al.  Combining bagging, boosting, rotation forest and random subspace methods , 2011, Artificial Intelligence Review.

[9]  Robert P. W. Duin,et al.  Bagging, Boosting and the Random Subspace Method for Linear Classifiers , 2002, Pattern Analysis & Applications.

[10]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[14]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Ramón Díaz-Uriarte,et al.  GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest , 2007, BMC Bioinformatics.

[16]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[17]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[18]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[20]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[22]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[23]  David Venet,et al.  Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome , 2011, PLoS Comput. Biol..

[24]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[25]  P. Broberg Statistical methods for ranking differentially expressed genes , 2003, Genome Biology.

[26]  Yanqing Zhang,et al.  Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis , 2007, TCBB.

[27]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[28]  J. Wang-Rodriguez,et al.  In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..

[30]  Tian-Li Wang,et al.  ARID1A, a factor that promotes formation of SWI/SNF-mediated chromatin remodeling, is a tumor suppressor in gynecologic cancers. , 2011, Cancer research.

[31]  Daniel Q. Naiman,et al.  Classifying Gene Expression Profiles from Pairwise mRNA Comparisons , 2004, Statistical applications in genetics and molecular biology.

[32]  Liang-Tsung Huang,et al.  An integrated method for cancer classification and rule extraction from microarray data , 2008, Journal of Biomedical Science.

[33]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[34]  J. Uhm An Integrated Genomic Analysis of Human Glioblastoma Multiforme , 2009 .

[35]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[36]  Jun Yokota,et al.  Frequent BRG1/SMARCA4–inactivating mutations in human lung cancer cell lines , 2008, Human mutation.

[37]  Glenn Fung,et al.  A Simple but Highly Effective Approach to Evaluate the Prognostic Performance of Gene Expression Signatures , 2011, PloS one.

[38]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[39]  Yuanyuan Ding,et al.  Improving the Performance of SVM-RFE to Select Genes in Microarray Data , 2006, BMC Bioinformatics.

[40]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[41]  Jaewoo Kang,et al.  Improving Cancer Classification Accuracy Using Gene Pairs , 2010, PloS one.

[42]  Igor Jurisica,et al.  Prognostic gene signatures for non-small-cell lung cancer , 2009, Proceedings of the National Academy of Sciences.

[43]  Hongyu Zhao,et al.  Weighted random subspace method for high dimensional data classification. , 2009, Statistics and its interface.

[44]  Madhu Chetty,et al.  Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data , 2005, BMC Bioinformatics.

[45]  E. Tholouli,et al.  Comparison of gene-expression profiles in parallel bone marrow and peripheral blood samples in acute myeloid leukaemia by real-time polymerase chain reaction , 2006, Journal of Clinical Pathology.

[46]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[47]  R Kahavi,et al.  Wrapper for feature subset selection , 1997 .

[48]  R. Brdička,et al.  High expression of ERCC1, FLT1, NME4 and PCNA associated with poor prognosis and advanced stages in myelodysplastic syndrome , 2008, Leukemia & lymphoma.

[49]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[50]  I. Halil Kavakli,et al.  Optimization Based Tumor Classification from Microarray Gene Expression Data , 2011, PloS one.

[51]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[52]  Martin A. Nowak,et al.  The significance of unstable chromosomes in colorectal cancer , 2003, Nature Reviews Cancer.

[53]  Marc D. H. Hansen,et al.  A zyxin-nectin interaction facilitates zyxin localization to cell-cell adhesions. , 2011, Biochemical and biophysical research communications.

[54]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[55]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.