Gene selection and classification using Taguchi chaotic binary particle swarm optimization

The purpose of gene expression analysis is to discriminate between classes of samples, and to predict the relative importance of each gene for sample classification. Microarray data with reference to gene expression profiles have provided some valuable results related to a variety of problems and contributed to advances in clinical medicine. Microarray data characteristically have a high dimension and a small sample size. This makes it difficult for a general classification method to obtain correct data for classification. However, not every gene is potentially relevant for distinguishing the sample class. Thus, in order to analyze gene expression profiles correctly, feature (gene) selection is crucial for the classification process, and an effective gene extraction method is necessary for eliminating irrelevant genes and decreasing the classification error rate. In this paper, correlation-based feature selection (CFS) and the Taguchi chaotic binary particle swarm optimization (TCBPSO) were combined into a hybrid method. The K-nearest neighbor (K-NN) with leave-one-out cross-validation (LOOCV) method served as a classifier for ten gene expression profiles. Experimental results show that this hybrid method effectively simplifies features selection by reducing the number of features needed. The classification error rate obtained by the proposed method had the lowest classification error rate for all of the ten gene expression data set problems tested. For six of the gene expression profile data sets a classification error rate of zero could be reached. The introduced method outperformed five other methods from the literature in terms of classification error rate. It could thus constitute a valuable tool for gene expression analysis in future studies.

[1]  Jin-Kao Hao,et al.  A Hybrid GA/SVM Approach for Gene Selection and Classification of Microarray Data , 2006, EvoWorkshops.

[2]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[3]  Oleg Okun,et al.  Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors , 2009, Artif. Intell. Medicine.

[4]  Gavin C. Cawley,et al.  Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers , 2003, Pattern Recognit..

[5]  Wen-Chin Chen,et al.  A neural network-based approach for dynamic quality prediction in a plastic injection molding process , 2008, Expert Syst. Appl..

[6]  Anil K. Ghosh,et al.  On optimum choice of k , 2006, Comput. Stat. Data Anal..

[7]  Shinn-Ying Ho,et al.  Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers , 2007, Biosyst..

[8]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[9]  Hongbin Zhang,et al.  Feature selection using tabu search method , 2002, Pattern Recognit..

[10]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[11]  Chi-Chun Huang,et al.  A Novel GA-Taguchi-Based Feature Selection Method , 2008, IDEAL.

[12]  Padraig Cunningham,et al.  Overfitting in Wrapper-Based Feature Subset Selection: The Harder You Try the Worse it Gets , 2004, SGAI Conf..

[13]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[14]  H. Schuster Deterministic chaos: An introduction , 1984 .

[15]  Russell C. Eberhart,et al.  A discrete binary version of the particle swarm algorithm , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[16]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[17]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[18]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[19]  Yue Shi,et al.  A modified particle swarm optimizer , 1998, 1998 IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98TH8360).

[20]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[21]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[22]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[24]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[25]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[26]  S. Y. Sohn,et al.  Experimental study for the comparison of classifier combination methods , 2007, Pattern Recognit..

[27]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[28]  Tzu-Tsung Wong,et al.  Two-stage classification methods for microarray data , 2008, Expert Syst. Appl..

[29]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[30]  Ting-Cheng Chang,et al.  Data mining and Taguchi method combination applied to the selection of discharge factors and the best interactive factor combination under multiple quality properties , 2006 .

[31]  Xin Zhang,et al.  A new chaotic algorithm for image encryption , 2006 .

[32]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[33]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[34]  Ioan Cristian Trelea,et al.  The particle swarm optimization algorithm: convergence analysis and parameter selection , 2003, Inf. Process. Lett..

[35]  Songbo Tan,et al.  An effective refinement strategy for KNN text classifier , 2006, Expert Syst. Appl..

[36]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[37]  Byung Ro Moon,et al.  Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Sung-Bae Cho,et al.  An Evolutionary Algorithm Approach to Optimal Ensemble Classifiers for DNA Microarray Data Analysis , 2008, IEEE Transactions on Evolutionary Computation.

[39]  Tung-Kuan Liu,et al.  Hybrid Taguchi-genetic algorithm for global numerical optimization , 2004, IEEE Transactions on Evolutionary Computation.

[40]  Xiangyang Wang,et al.  Feature selection based on rough sets and particle swarm optimization , 2007, Pattern Recognit. Lett..

[41]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[42]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[43]  Zexuan Zhu,et al.  Wrapper–Filter Feature Selection Algorithm Using a Memetic Framework , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[44]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[45]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[46]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[47]  Yuin Wu,et al.  Taguchi Methods for Robust Design , 2000 .

[48]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[49]  Li-Yeh Chuang,et al.  A novel BPSO approach for gene selection and classification of microarray data , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[50]  Pravin M. Vaidya,et al.  AnO(n logn) algorithm for the all-nearest-neighbors Problem , 1989, Discret. Comput. Geom..

[51]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[52]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[53]  Li-Yeh Chuang,et al.  Improved binary PSO for feature selection using gene expression data , 2008, Comput. Biol. Chem..

[54]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[55]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[56]  Yanqing Zhang,et al.  Recursive Fuzzy Granulation for Gene Subsets Extraction and Cancer Classification , 2008, IEEE Transactions on Information Technology in Biomedicine.

[57]  D. Wolpert On Overfitting Avoidance as Bias , 1993 .

[58]  D. Kuo Chaos and its computing paradigm , 2005, IEEE Potentials.

[59]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[60]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[61]  Cullen Schaffer Overfitting avoidance as bias , 2004, Machine Learning.

[62]  Xiaoxing Liu,et al.  An Entropy-based gene selection method for cancer classification using microarray data , 2005, BMC Bioinformatics.