A classification framework applied to cancer gene expression profiles.

Classification of cancer based on gene expression has provided insight into possible treatment strategies. Thus, developing machine learning methods that can successfully distinguish among cancer subtypes or normal versus cancer samples is important. This work discusses supervised learning techniques that have been employed to classify cancers. Furthermore, a two-step feature selection method based on an attribute estimation method (e.g., ReliefF) and a genetic algorithm was employed to find a set of genes that can best differentiate between cancer subtypes or normal versus cancer samples. The application of different classification methods (e.g., decision tree, k-nearest neighbor, support vector machine (SVM), bagging, and random forest) on 5 cancer datasets shows that no classification method universally outperforms all the others. However, k-nearest neighbor and linear SVM generally improve the classification performance over other classifiers. Finally, incorporating diverse types of genomic data (e.g., protein-protein interaction data and gene expression) increase the prediction accuracy as compared to using gene expression alone.

[1]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[2]  Wei-Gang Hu,et al.  Identification of a 12-Gene Signature for Lung Cancer Prognosis through Machine Learning , 2011 .

[3]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[4]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[5]  Brijesh Verma,et al.  Neural vs. statistical classifier in conjunction with genetic algorithm based feature selection , 2005, Pattern Recognit. Lett..

[6]  E. Boerwinkle,et al.  Computational methods for gene expression-based tumor classification. , 2000, BioTechniques.

[7]  Miguel Srougi,et al.  Abnormal Expression of MDM2 in Prostate Carcinoma , 2001, Modern Pathology.

[8]  C. Rosenow,et al.  Monitoring gene expression using DNA microarrays. , 2000, Current opinion in microbiology.

[9]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[10]  Jianhua Ruan,et al.  Identification of biomarkers in breast cancer metastasis by integrating protein-protein interaction network and gene expression data , 2011, 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS).

[11]  Tze-Yun Leong,et al.  Application of K-nearest neighbors algorithm on breast cancer diagnosis problem , 2000, AMIA.

[12]  S. Pucci,et al.  CLU and colon cancer. The dual face of CLU: from normal to malignant phenotype. , 2009, Advances in cancer research.

[13]  S. Rashidaee,et al.  REDUCTION OF COGGING TORQUE IN IPM MOTORS BY USING THE TAGUCHI AND FINITE ELEMENT METHOD , 2011 .

[14]  Hesham Mohamed El-Deeb,et al.  Suite of decision tree-based classification algorithms on cancer gene expression data , 2011 .

[15]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[16]  Amr Badr,et al.  Feature Selection for Cancer Classification: An SVM based Approach , 2012 .

[17]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[18]  Giorgio Valentini,et al.  Cancer recognition with bagged ensembles of support vector machines , 2004, Neurocomputing.

[19]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[21]  Louise C. Showe,et al.  Classification and biomarker identification using gene network modules and support vector machines , 2009, BMC Bioinformatics.

[22]  Robert Clarke,et al.  Identifying cancer biomarkers by network-constrained support vector machines , 2011, BMC Systems Biology.

[23]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[24]  Manju Sardana,et al.  A Comparative Study of Clustering Methods for Relevant Gene Selection in Microarray Data , 2012 .

[25]  Vaidyanathan K. Jayaraman,et al.  Biogeography-based informative gene selection and cancer classification using SVM and Random Forests , 2012, 2012 IEEE Congress on Evolutionary Computation.

[26]  David S. Wishart,et al.  Applications of Machine Learning in Cancer Prediction and Prognosis , 2006, Cancer informatics.

[27]  Vibhav Prakash Singh,et al.  Hybrid Correlation based Gene Selection for Accurate Cancer Classification of Gene Expression Data , 2012 .

[28]  Xuefeng Bruce Ling,et al.  Multiclass cancer classification and biomarker discovery using GA-based algorithms , 2005, Bioinform..

[29]  Giancarlo Mauri,et al.  A comparison of machine learning techniques for survival prediction in breast cancer , 2011, BioData Mining.

[30]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[31]  E. Salvaris,et al.  The Role of the LFA‐1/ICAM‐1 Interaction in Human Leukocyte Homing and Adhesion , 1989, Immunological reviews.

[32]  Muhammad Ashraf Shaheen,et al.  CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY , 2012 .

[33]  Tansel Özyer,et al.  Robust integrated framework for effective feature selection and sample classification and its application to gene expression data analysis , 2012, 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[34]  Marek Kretowski,et al.  Multi-Test Decision Trees for Gene Expression Data Analysis , 2011, SIIS.

[35]  G. Victo Sudha George,et al.  Review on Feature Selection Techniques and the Impact of SVM for Cancer Classification using Gene Expression Profile , 2011, ArXiv.

[36]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Piero Fariselli,et al.  Blurring contact maps of thousands of proteins: what we can learn by reconstructing 3D structure , 2011, BioData Mining.

[38]  N. Copeland,et al.  The human mutator gene homolog MSH2 and its association with hereditary nonpolyposis colon cancer , 1993, Cell.

[39]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[40]  M. Lai,et al.  SVM-T-RFE: a novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles. , 2012, Biochemical and biophysical research communications.

[41]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machines , 2002 .

[42]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[43]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[44]  Ivan Bratko,et al.  Machine Learning for Survival Analysis: A Case Study on Recurrence of Prostate Cancer , 1999, AIMDM.

[45]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[46]  Jingjing Liu,et al.  Cancer classification based on microarray gene expression data using a principal component accumulation method , 2011 .

[47]  Dharminder Kumar,et al.  DATA MINING CLASSIFICATION TECHNIQUES APPLIED FOR BREAST CANCER DIAGNOSIS AND PROGNOSIS , 2011 .

[48]  Yanqing Zhang,et al.  A genetic algorithm-based method for feature subset selection , 2008, Soft Comput..

[49]  Dursun Delen,et al.  Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[50]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[51]  Youping Deng,et al.  Gene selection and classification for cancer microarray data based on machine learning and similarity measures , 2011, BMC Genomics.

[52]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[53]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[54]  F. Harrell,et al.  Artificial neural networks improve the accuracy of cancer survival prediction , 1997, Cancer.

[55]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[56]  Yu-Chieh Wang,et al.  A novel multi-task support vector sample learning technique to predict classification of cancer , 2010, 4th International Conference on New Trends in Information Science and Service Science.

[57]  Rui Jiang,et al.  A genetic algorithm for optimizing subnetwork markers for the study of breast cancer metastasis , 2011, 2011 Seventh International Conference on Natural Computation.

[58]  Shie Mannor,et al.  Robustness and Regularization of Support Vector Machines , 2008, J. Mach. Learn. Res..

[59]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[60]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[61]  Francisco Azuaje,et al.  An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors , 2006, BMC Medical Informatics Decis. Mak..

[62]  Hung-Wen Chiu,et al.  Artificial Neural Network Prediction for Cancer Survival Time by Gene Expression Data , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[63]  Alessandra Alaniz Macedo,et al.  Applying Decision Trees to Gene Expression Data from DNA Microarrays: A Leukemia Case Study , 2010 .

[64]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[65]  M. West,et al.  Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[66]  George Dimitoglou,et al.  Comparison of the C4.5 and a Naive Bayes Classifier for the Prediction of Lung Cancer Survivability , 2012, ArXiv.

[67]  Dr.S. Santhosh Baboo,et al.  Multicategory Classification Using Support Vector Machine for Microarray Gene Expression Cancer Diagnosis , 2010 .

[68]  Yanqing Zhang,et al.  Improving Feature Subset Selection Using a Genetic Algorithm for Microarray Gene Expression Data , 2006, 2006 IEEE International Conference on Evolutionary Computation.