Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm

BackgroundIn the application of microarray data, how to select a small number of informative genes from thousands of genes that may contribute to the occurrence of cancers is an important issue. Many researchers use various computational intelligence methods to analyzed gene expression data.ResultsTo achieve efficient gene selection from thousands of candidate genes that can contribute in identifying cancers, this study aims at developing a novel method utilizing particle swarm optimization combined with a decision tree as the classifier. This study also compares the performance of our proposed method with other well-known benchmark classification methods (support vector machine, self-organizing map, back propagation neural network, C4.5 decision tree, Naive Bayes, CART decision tree, and artificial immune recognition system) and conducts experiments on 11 gene expression cancer datasets.ConclusionBased on statistical analysis, our proposed method outperforms other popular classifiers for all test datasets, and is compatible to SVM for certain specific datasets. Further, the housekeeping genes with various expression patterns and tissue-specific genes are identified. These genes provide a high discrimination power on cancer classification.

[1]  A. Levine,et al.  Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. , 2001, Combinatorial chemistry & high throughput screening.

[2]  R Kahavi,et al.  Wrapper for feature subset selection , 1997 .

[3]  A. Brazma,et al.  Gene expression data analysis. , 2001, FEBS letters.

[4]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[5]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[6]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[7]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[8]  Y. Rahmat-Samii,et al.  Particle swarm optimization in electromagnetics , 2004, IEEE Transactions on Antennas and Propagation.

[9]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[10]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[11]  Wei Kong,et al.  Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data , 2008, Comput. Biol. Chem..

[12]  S. Fiske,et al.  The Handbook of Social Psychology , 1935 .

[13]  Mingzhi Liao,et al.  Predicting human microRNA precursors based on an optimized feature subset generated by GA-SVM. , 2011, Genomics.

[14]  Lipika Dey,et al.  A feature selection technique for classificatory analysis , 2005, Pattern Recognit. Lett..

[15]  T. Ørntoft,et al.  Gene expression profiling: monitoring transcription and translation products using DNA microarrays and proteomics , 2000, FEBS letters.

[16]  Pa-Chun Wang,et al.  Particle swarm optimization for feature selection with application in obstructive sleep apnea diagnosis , 2011, Neural Computing and Applications.

[17]  Vasile Palade,et al.  microPred: effective classification of pre-miRNAs for human miRNA gene prediction , 2009, Bioinform..

[18]  BMC Bioinformatics , 2005 .

[19]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[20]  Mauro Birattari,et al.  Swarm Intelligence , 2012, Lecture Notes in Computer Science.

[21]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[22]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[23]  Yue Shi,et al.  A modified particle swarm optimizer , 1998, 1998 IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98TH8360).

[24]  Jung Hun Oh,et al.  A kernel-based approach for detecting outliers of high-dimensional biological data , 2009, BMC Bioinformatics.

[25]  Mohd Saberi Mohamad,et al.  Particle swarm optimization for gene selection in classifying cancer classes , 2009, Artificial Life and Robotics.

[26]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[27]  Wei-Chung Cheng,et al.  Microarray meta-analysis database (M2DB): a uniformly pre-processed, quality controlled, and manually curated human clinical microarray database , 2010, BMC Bioinformatics.

[28]  Ludger Evers,et al.  Sparse kernel methods for high-dimensional survival data , 2008, Bioinform..

[29]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[30]  Igor Kononenko,et al.  A counter example to the stronger version of the binarytree hypothesisIgor , 1995 .

[31]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[32]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[33]  Huiling Chen,et al.  A Novel Framework for Gene Selection , 2011 .

[34]  Enrique Alba,et al.  Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms , 2007, 2007 IEEE Congress on Evolutionary Computation.

[35]  S. Larson The shrinkage of the coefficient of multiple correlation. , 1931 .

[36]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[37]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[38]  Xia Li,et al.  Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling. , 2004, Nucleic acids research.

[39]  Yvan Saeys,et al.  Feature selection for splice site prediction: A new method using EDA-based feature ranking , 2004, BMC Bioinformatics.

[40]  Shutao Li,et al.  Gene selection using hybrid particle swarm optimization and genetic algorithm , 2008, Soft Comput..

[41]  Loris Nanni,et al.  Combining multiple approaches for gene microarray classification , 2012, Bioinform..

[42]  Li Li,et al.  A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. , 2005, Genomics.

[43]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Doheon Lee,et al.  Inference of combinatorial Boolean rules of synergistic gene sets from cancer microarray datasets , 2010, Bioinform..

[45]  Xing-Ming Zhao,et al.  A novel approach to extracting features from motif content and protein composition for protein sequence classification , 2005, Neural Networks.

[46]  Wei Pan,et al.  Network-based support vector machine for classification of microarray samples , 2009, BMC Bioinformatics.