Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors

OBJECTIVE We explore the link between dataset complexity, determining how difficult a dataset is for classification, and classification performance defined by low-variance and low-biased bolstered resubstitution error made by k-nearest neighbor classifiers. METHODS AND MATERIAL Gene expression based cancer classification is used as the task in this study. Six gene expression datasets containing different types of cancer constitute test data. RESULTS Through extensive simulation coupled with the copula method for analysis of association in bivariate data, we show that dataset complexity and bolstered resubstitution error are associated in terms of dependence. As a result, we propose a new scheme for generating ensembles of classifiers that selects subsets of features of low complexity for ensemble members, which constitutes the accurate members according to the found dependence relation. CONCLUSION Experiments with six gene expression datasets demonstrate that our ensemble generating scheme based on the dependence of dataset complexity and classification error is superior to a single best classifier in the ensemble and to the traditional ensemble construction scheme that is ignorant of dataset complexity.

[1]  Sung-Bae Cho,et al.  Data Mining For Gene Expression Profiles From Dna Microarray , 2003, Int. J. Softw. Eng. Knowl. Eng..

[2]  E. Dougherty,et al.  Genomic Signal Processing and Statistics , 2005 .

[3]  Ulisses Braga-Neto,et al.  Impact of error estimation on feature selection , 2005, Pattern Recognit..

[4]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Siegfried J. Pöppl,et al.  The 'subsequent artificial neural network' (SANN) approach might bring more classificatory power to ANN-based DNA microarray analyses , 2004, Bioinform..

[6]  R. Nelsen An Introduction to Copulas , 1998 .

[7]  Stephen D. Bay Nearest neighbor classification from multiple feature subsets , 1999, Intell. Data Anal..

[8]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[9]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Sung-Bae Cho,et al.  Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features , 2002, Proc. IEEE.

[11]  Sung-Bae Cho,et al.  DNA Gene Expression Classification with Ensemble Classifiers Optimized by Speciated Genetic Algorithm , 2005, PReMI.

[12]  Philip M. Long,et al.  Boosting and Microarray Data , 2003, Machine Learning.

[13]  Lei Yu,et al.  Feature Selection for Genomic Data Analysis , 2007 .

[14]  H. Joe Multivariate models and dependence concepts , 1998 .

[15]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[16]  B. Schweizer,et al.  On Nonparametric Measures of Dependence for Random Variables , 1981 .

[17]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[18]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[19]  Oleg Okun,et al.  Ensembles of k-nearest neighbors and dimensionality reduction , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[20]  Sung-Bae Cho,et al.  Evolutionary Computation for Optimal Ensemble Classifier in Lymphoma Cancer Classification , 2003, ISMIS.

[21]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[22]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .

[23]  Giorgio Valentini,et al.  Dataset complexity can help to generate accurate ensembles of k-nearest neighbors , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[24]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[25]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[26]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[27]  R. Nelsen An Introduction to Copulas (Springer Series in Statistics) , 2006 .

[28]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[29]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[30]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Javier De Las Rivas,et al.  Combining dissimilarity based classifiers for cancer prediction using gene expression profiles , 2007, BMC Bioinformatics.

[32]  Ulisses Braga-Neto,et al.  Bolstered error estimation , 2004, Pattern Recognit..

[33]  Yuhong Yang,et al.  Combining Nearest Neighbor Classifiers Versus Cross-Validation Selection , 2004, Statistical applications in genetics and molecular biology.

[34]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[35]  Tianzi Jiang,et al.  A combinational feature selection and ensemble neural network method for classification of gene expression data , 2004, BMC Bioinformatics.

[36]  Sung-Bae Cho,et al.  The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming , 2006, Artif. Intell. Medicine.

[37]  James J. Chen,et al.  Ensemble methods for classification of patients for personalized medicine with high-dimensional data , 2007, Artif. Intell. Medicine.

[38]  Giorgio Valentini,et al.  Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles , 2002, Artif. Intell. Medicine.

[39]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[40]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.