A novel divide-and-merge classification for high dimensional datasets

High dimensional datasets contain up to thousands of features, and can result in immense computational costs for classification tasks. Therefore, these datasets need a feature selection step before the classification process. The main idea behind feature selection is to choose a useful subset of features to significantly improve the comprehensibility of a classifier and maximize the performance of a classification algorithm. In this paper, we propose a one-per-class model for high dimensional datasets. In the proposed method, we extract different feature subsets for each class in a dataset and apply the classification process on the multiple feature subsets. Finally, we merge the prediction results of the feature subsets and determine the final class label of an unknown instance data. The originality of the proposed model is to use appropriate feature subsets for each class. To show the usefulness of the proposed approach, we have developed an application method following the proposed model. From our results, we confirm that our method produces higher classification accuracy than previous novel feature selection and classification methods.

[1]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[2]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[4]  Gang Liu,et al.  Effects of cigarette smoke on the human airway epithelial cell transcriptome. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Kezhi Mao,et al.  Feature subset selection for support vector machines through discriminative function pruning analysis , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[7]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[8]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[9]  T. Golub,et al.  Gene expression-based chemical genomics identifies rapamycin as a modulator of MCL1 and glucocorticoid resistance. , 2006, Cancer cell.

[10]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[11]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[12]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[13]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[14]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[15]  Adam C. Winstanley,et al.  Invariant optimal feature selection: A distance discriminant and feature ranking based solution , 2008, Pattern Recognit..

[16]  Krzysztof Michalak,et al.  Correlation-based Feature Selection Strategy in Neural Classification , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[17]  Riyaz Sikora,et al.  Framework for efficient feature selection in genetic algorithm based data mining , 2007, Eur. J. Oper. Res..

[18]  Thomas G. Dietterich,et al.  Error-Correcting Output Codes: A General Method for Improving Multiclass Inductive Learning Programs , 1991, AAAI.

[19]  Aravind Subramanian,et al.  A zebrafish bmyb mutation causes genome instability and increased cancer susceptibility. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Qingzhong Liu,et al.  Comparison of feature selection and classification for MALDI-MS data , 2009, BMC Genomics.

[21]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[22]  J. Mesirov,et al.  Metagene projection for cross-platform, cross-species characterization of global transcriptional states , 2007, Proceedings of the National Academy of Sciences.

[23]  Richard Weber,et al.  Simultaneous feature selection and classification using kernel-penalized support vector machines , 2011, Inf. Sci..

[24]  Sungzoon Cho,et al.  Ensemble based on GA wrapper feature selection , 2006, Comput. Ind. Eng..

[25]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[26]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[27]  Cheng-Lung Huang,et al.  A GA-based feature selection and parameters optimizationfor support vector machines , 2006, Expert Syst. Appl..

[28]  Sejong Oh A new dataset evaluation method based on category overlap , 2011, Comput. Biol. Medicine.

[29]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[30]  Yujin Hoshida,et al.  Nearest Template Prediction: A Single-Sample-Based Flexible Class Prediction with Confidence Assessment , 2010, PloS one.

[31]  Sejong Oh,et al.  Derivation of an artificial gene to improve classification accuracy upon gene selection , 2012, Comput. Biol. Chem..

[32]  Emmanuel Dellandréa,et al.  Image Categorization Using ESFS: A New Embedded Feature Selection Method Based on SFS , 2009, ACIVS.

[33]  Jill P. Mesirov,et al.  Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets , 2007, PloS one.

[34]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.